Download Speaking Faces

The SpeakingFaces dataset is available through the server of the Institute for Smart Systems and Artificial Intelligence (ISSAI) under Creative Commons Attribution 4.0 International License. ISSAI is a member of DataCite and a digital object identifier (DOI) was assigned by the ISSAI Repository to the SpeakingFaces dataset ( This research project was approved by the Institutional Research Ethics Committee of Nazarbayev University. All participants signed informed consent forms to participate in the study, and agreed with the public sharing of the data.

The data was collected from 142 subjects of various backgrounds. Each subject participated in two trials that were held on two separate days. There were two types of sessions during a single trial. In the first session, subjects were silent and still, with the operator capturing the visual and thermal video streams through the procession of nine collection angles. The second session consisted of the subject reading a series of commands as presented one-by-one on the video screens, as the visual, thermal and audio data was collected from the same nine camera positions. The commands were sourced from Thingpedia, an open and crowd-sourced knowledge base for virtual assistants, along with publicly available commands for Siri. Further details on the data acquisition and preprocessing procedure can be found in our paper.

If you are interested in gaining access to our data please please fill this form. You will be provided with credentials and instructions on how to connect to our server. If you are a reviewer for a related paper, please use the credentials provided in the cover letter.

The public repository consists of annotated data (metadata), raw data, and clean data. Let us first introduce the notation relevant to the names of directories and files in the figures below:

File structure of SpeakingFaces repository is presented below. File names are suffixed by subID and trialID, bringing the total number of files to the indicated max (142 or 284).

The annotated data is stored in metadata directory, which consists of subjects.csv and commands subdirectory. The former contains the information on the ID, split (train/valid/test), gender, ethnicity, age, and accessories (hat, glasses, etc.) in both trials for each subject. The latter consists of sub_subID_trial_trialID.csv, composed of records on each command uttered by the subject subID in the trial trialID. There are 284 files in total, two files for each of the 142 subjects. A record includes the command name, the command identifier, the identifier of a camera position at which the utterance was captured, the transcription of the uttered command, and information on the artifacts detected in the recording.

There are four categories of artifacts, corresponding to the four data streams: thermal, visual, audio, and text. For each stream, the table below lists detected artifacts and the corresponding numerical value recorded in metadata. Thus, an utterance that is “clean” of any noise in the data would have 0 in all four categories. Depending on the application of the dataset, users can decide which of the artifacts are acceptable and select the data in accordance with their preference.

The raw data on “non-speaking” session can be found in video_only_raw, which contains the compressed version of unprocessed video files from both trials for a given subject. The raw data for the other session can be located in video_audio_raw.

The clean data corresponds to the result of the whole data preprocessing pipeline. img_only directory contains the compressed version of thermal, visual, and aligned visual image frames from the first session. In addition to the image frames, img_audio folder contains the audio tracks for each spoken utterance in the second sessions.

The folders video_only_raw, video_audio_raw, img_only, img_audio contain 142 files each. Each file is a :zip archive that contains data for one of the subjects. The data should be extracted first and the resulting file structure is illustrated below:

You can download scripts and models from our github directory.

Please cite as:

M. Abdrakhmanova, A. Kuzdeuov, S. Jarju, Y. Khassanov, M. Lewis and H. A. Varol. “SpeakingFaces: A Large-Scale Multimodal Dataset of Voice Commands with Visual and Thermal Video Streams”. arXiv preprint arXiv:2012.02961 (2020).

M. Abdrakhmanova, A. Kuzdeuov, S. Jarju, Y. Khassanov, M. Lewis, H. A. Varol, “ISSAI SpeakingFaces Dataset.” Institute of Smart Systems and Artificial Intelligence, 2020, doi: 10.48333/SMGD-YJ77.