Saitoh lab

Kyushu Institute of Technology

SSSD: Speech Scene by Smart Device

What is SSSD?

The speaker uses the in-camera of the smart device to record the utterance scenes by the speaker. The speaker took the speech scene by putting the smart device on the desk, or holding the smart device in his/her hand. Many scenes were taken in university laboratory room or apartment room, however there are also scenes recorded in the car. There is no restriction on the sound or silent during the recording.


  • May 22, 2018: Released 5,000 test data for the competition 'Lip Reading Challenge' (5th Silent Speech Recognition Workshop).
  • April 8, 2018: Released our database.

  • Speaker: 36
  • Utterance: 25 words
  • Language: Japanese
  • Available data
    • Lower face image (image size: 300x300[pixel])
    • Feature points
  • Device: Smart devices, such as Apple iPad Pro and smartphones
  • Frame rate: 30fps
  • Released datte: March xx, 2018

Details of SSSD

  • Utterance content: Japanese 25 words
    # Utterance # Utterance # Utterance
    0 /ze-ro/ 10 /a-ri-ga-to-u/ 20 /do-u-i-ta-shi-ma-shi-te/
    1 /i-chi/ 11 /i-i-e/ 21 /ha-i/
    2 /ni/ 12 /o-ha-yo-u/ 22 /ha-ji-me-ma-shi-te/
    3 /sa-N/ 13 /o-me-de-to-u/ 23 /ma-ta-ne/
    4 /yo-N/ 14 /o-ya-su-mi/ 24 /mo-shi-mo-shi/
    5 /go/ 15 /go-me-N-na-sa-i/
    6 /ro-ku/ 16 /ko-N-ni-chi-wa/
    7 /na-na/ 17 /ko-N-ba-N-wa/
    8 /ha-chi/ 18 /sa-yo-u-na-ra/
    9 /kyu/ 19 /su-mi-ma-se-N/

  • Number of speaker: 36 (21 males and 15 females)
    • Gender distribution

    • Age distribution
      All speakers are university students.

  • Available data
    • 36 subjects (21M + 15F) x 25 words x 30 samples = 27,000 samples

    • Lower half face ROI (LF-ROI) images (image size : 300 x 300 [pixels], jpeg-format)
      Frame images are stored in folder at each utterance sample.
      All 750 samples of each speaker are compressed in zip-format. Each file size of zip-file is approximately 800 MB, 28.7 GB for 36 speakers.
      The figure below is released LR-ROIs of 36 speakers.

      Sample: (approx. 640 kB) An utterance sample of word #1 of speaker s10.

    • Facial feature points (70 points, csv-format)
      All 750 samples of each speaker are compressed in zip-format. Each file size of zip-file is approx. 5.7 MB, 200 MB for 36 speakers.
      Sample: s10_001_001.csv (approx. 16 kB) An utterance sample of word #1 of speaker s10.

    • Audio data: none

    • Relationship between LF-ROI and feature points

      * This is a different sample from s10_001_001.

    • Preprocessing (how to extract LF-ROI): Please refer [1].

    • Training data and test data of [1]

    • Other: The provided data is as of April 8, 2018. In the future, the contents of the provided data may change.

  • Devices: Smart device (Apple iPad Pro, and various smartphones)

  • Frame rate: 30 fps

  • Limitation: Only use tge database for academic research. The user may not distribute the database or portions thereof in any way.

  • Release data: April 8, 2018

  • Price: free

  • Distribution: If you want to get this database, please contact to the following address.
    E-mail: saitoh at

  • Publications:
    • [1] Takeshi Saitoh, Michiko Kubokawa: SSSD: Speech Scene Database by Smart Device for Visual Speech Recognition, Proc. of ICPR2018, pp.3228-3232, 2018.

  • Contact:
    Kyushu Institute of Technology
    Takeshi Saitoh
    E-mail: saitoh at

  • This project has been supported by JSPS KAKENHI Grant Number 16H03211 and the Program for Building Regional Innovation Ecosystems.