We present SpeakingFaces, a publicly-available large-scale dataset developed to support multimodal machine learning research in contexts that utilize a combination of thermal, visual, and audio data streams; application domain examples include human-machine interaction, biometric authentication, recognition systems, domain transfer, and speech recognition.

SpeakingFaces is comprised of well-aligned high-resolution thermal and visual spectra image streams of fully-framed faces synchronized with audio recordings of each subject speaking up to 100 imperative phrases. Data was collected from 142 subjects, yielding over 13,000 instances of synchronized data

Further details on the dataset, including our preliminary experiments, can be found in our paper.


M. Abdrakhmanova, A. Kuzdeuov, S. Jarju, Y. Khassanov, M. Lewis, H. A. Varol, “SpeakingFaces: A Large-Scale Multimodal Dataset of Voice Commands with Visual and Thermal Video Streams”.  arXiv preprint arXiv:2012.02961 (2020).

M. Abdrakhmanova, A. Kuzdeuov, S. Jarju, Y. Khassanov, M. Lewis, H. A. Varol, “ISSAI SpeakingFaces Dataset.” Institute of Smart Systems and Artificial Intelligence, 2020, doi: 10.48333/SMGD-YJ77.

The protocol for this study was approved by the Institutional Research Ethics Committee of Nazarbayev University.
This work is licensed under a Creative Commons Attribution 4.0 International License.

Creative Commons License