ISSAI Datasets

Kazakh Speech Corpus (KSC)

The KSC is the largest publicly available dataset developed to advance various Kazakh speech and language processing applications. It contains around 335 hours of manually transcribed audio comprising over 154,000 utterances spoken by participants from different regions of Kazakhstan.

LICENCE: Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/)

Code Download
Kazakh Text-to-Speech (KazakhTTS)

The KazakhTTS is a high-quality open-source speech dataset that contains over 90 hours of audio recorded by two professional speakers (one male and one female).

LICENCE: Creative Commons Attribution 4.0 International License
(https://creativecommons.org/licenses/by/4.0/)

Code Download
Speaking Faces

A large-scale publicly-available dataset designed to encourage research in the general areas of user authentication, facial recognition, speech recognition and human-computer interaction.

LICENCE: Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/)

Code Download
SF-TL54: A Thermal Facial Landmark Dataset with Visual Pairs

A thermal face dataset with manually annotated bounding boxes and facial landmarks. The dataset was constructed using our large-scale SpeakingFaces dataset (https://issai.nu.edu.kz/speaking-faces/). In total, the dataset contains 2,556 thermal-visual image pairs of 142 subjects, where each subject has 18 thermal-visual image pairs (2 trial x 9 positions).

LICENCE: Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/)

Code Download
TFW: Annotated Thermal Faces in the Wild Dataset

The dataset contains thermal images acquired in indoor (controlled) and outdoor (uncontrolled) environments. The indoor dataset was constructed using our previously published SpeakingFaces dataset. The outdoor dataset was collected using the same FLIR T540 thermal camera with a resolution of 464x348 pixels, a wave-band of 7.5–14 μm, the field of view 24, and an iron color palette. The dataset was manually annotated with face bounding boxes and five-point facial landmarks.

LICENCE: Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/)

Code Download
Uzbek Speech Corpus (USC)

The USC is an open-source speech corpus that has been developed in collaboration between ISSAI and the Image and Speech Processing Laboratory in the Department of Computer Systems of the Tashkent University of Information Technologies (https://tuit.uz/en/kompyuter-tizimlari). The USC comprises 958 different speakers with a total of 105 hours of transcribed audio recordings.

LICENCE: Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/)

Code Download
Russian Speech Corpus (OpenSTT-CS334)

The OpenSTT-CS334 is a manually re-transcribed 334-hour clean subset of the Russian OpenSTT (https://github.com/snakers4/open_stt). The dataset contains recordings only from the books and YouTube domain.

LICENCE: Creative Commons Attribution-NonCommercial 4.0 International License (https://creativecommons.org/licenses/by-nc/4.0/)

Code Download
Kazakh-accented English

The dataset consists of Kazakh-accented English recordings (~7.7 hours) extracted from the SpeakingFaces (https://doi.org/10.48333/smgd-yj77), i.e., native Kazakh speakers uttering English verbal commands given to virtual assistants and other smart devices, such as ‘turn off the lights’, ‘play the next song’, and so on.

LICENCE: Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/)

Code Download
WiFine

A finer-level sequential dataset of WiFi received signal strengths (RSS). The dataset contains 290 trajectories collected across 3 floors of the C4 building of Nazarbayev University. The RSS values with corresponding position coordinates (x,y,z) are recorded around every 5 seconds.

LICENCE: Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/)

Code
IMUWiFine

A finer-level sequential dataset of IMU and WiFi received signal strengths (RSS). The dataset contains 120 trajectories covering an aggregate distance of over 14 kilometers. The dataset was collected across 3 floors of the C4 building of Nazarbayev University.

LICENCE: Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/)

Code Download