ISSAI - Institute of Smart Systems and Artificial Intelligence

Tatar Speech Corpus ASR

TatSC contains 269.1 hours of transcribed speech with 271,914 utterances. It is the first open-source Tatar speech corpus covering both crowdsourced and audiobooks data.

LICENCE: Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/ )

Code Download

A Central Asian Food Dataset for Personalized Dietary Interventions

First Central Asian Food Dataset, containing 16,499 images across 42 food items. The dataset is unbalanced, the number of images per class varies from 99 to 922. The dataset is websrated and contains extracted frames from the videos.

LICENCE: Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/)

Code Download

The Kazakh Named Entity Recognition

The Kazakh Named Entity Recognition Dataset (KazNERD) contains 112,702 sentences, extracted from the television news text, and 136,333 annotations for 25 entity classes. All sentences in the dataset were manually annotated by two native Kazakh-speaking linguists, supervised by an ISSAI researcher. The IOB2 scheme was used for annotation. The dataset is in the CoNLL 2002 format.

LICENCE: Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/)

Code

Turkish Speech Corpus

The corpus contains 218.2 hours of transcribed speech with 186,171 utterances and is the largest publicly available Turkish dataset of its kind. The datasets and codes used to train the models are available for download at TurkicASR.

LICENCE: Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/ )

Code Download

Kazakh Speech Corpus 2 (KSC2)

The first industrial-scale open-source Kazakh speech corpus (KSC2). KSC2 corpus subsumes the previously introduced two corpora: Kazakh speech corpus and Kazakh TTS 2, and supplements additional data from other sources like tv programs, radio, senate, and podcasts. In total, KSC2 contains around 1.2k hours of high-quality transcribed data comprising over 600k utterances.

LICENCE: Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/ )

Code Download

Kazakh Text-to-Speech 2 (KazakhTTS2)

An expanded version of the previously released Kazakh text-to-speech (KazakhTTS) synthesis corpus. In the new KazakhTTS2 corpus, the overall size has increased from 93 hours to 271 hours, the number of speakers has risen from two to five (three females and two males), and the topic coverage has been diversified with the help of new sources, including a book and Wikipedia articles.

LICENCE: Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/ )

Code Download

Speaking Faces

A large-scale publicly-available dataset designed to encourage research in the general areas of user authentication, facial recognition, speech recognition and human-computer interaction.

LICENCE: Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/)

Code Download

SF-TL54: A Thermal Facial Landmark Dataset with Visual Pairs

A thermal face dataset with manually annotated bounding boxes and facial landmarks. The dataset was constructed using our large-scale SpeakingFaces dataset (https://issai.nu.edu.kz/speaking-faces/). In total, the dataset contains 2,556 thermal-visual image pairs of 142 subjects, where each subject has 18 thermal-visual image pairs (2 trial x 9 positions).

LICENCE: Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/)

Code Download

TFW: Annotated Thermal Faces in the Wild Dataset

The dataset contains thermal images acquired in indoor (controlled) and outdoor (uncontrolled) environments. The indoor dataset was constructed using our previously published SpeakingFaces dataset. The outdoor dataset was collected using the same FLIR T540 thermal camera with a resolution of 464x348 pixels, a wave-band of 7.5–14 μm, the field of view 24, and an iron color palette. The dataset was manually annotated with face bounding boxes and five-point facial landmarks.

LICENCE: Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/)

Code Download

Uzbek Speech Corpus (USC)

The USC is an open-source speech corpus that has been developed in collaboration between ISSAI and the Image and Speech Processing Laboratory in the Department of Computer Systems of the Tashkent University of Information Technologies (https://tuit.uz/en/kompyuter-tizimlari). The USC comprises 958 different speakers with a total of 105 hours of transcribed audio recordings.

LICENCE: Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/)

Code Download

Russian Speech Corpus (OpenSTT-CS334)

The OpenSTT-CS334 is a manually re-transcribed 334-hour clean subset of the Russian OpenSTT (https://github.com/snakers4/open_stt). The dataset contains recordings only from the books and YouTube domain.

LICENCE: Creative Commons Attribution-NonCommercial 4.0 International License (https://creativecommons.org/licenses/by-nc/4.0/)

Code Download

Kazakh-accented English

The dataset consists of Kazakh-accented English recordings (~7.7 hours) extracted from the SpeakingFaces (https://doi.org/10.48333/smgd-yj77), i.e., native Kazakh speakers uttering English verbal commands given to virtual assistants and other smart devices, such as ‘turn off the lights’, ‘play the next song’, and so on.

LICENCE: Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/)

Code Download

WiFine

A finer-level sequential dataset of WiFi received signal strengths (RSS). The dataset contains 290 trajectories collected across 3 floors of the C4 building of Nazarbayev University. The RSS values with corresponding position coordinates (x,y,z) are recorded around every 5 seconds.

LICENCE: Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/)

Code

IMUWiFine

A finer-level sequential dataset of IMU and WiFi received signal strengths (RSS). The dataset contains 120 trajectories covering an aggregate distance of over 14 kilometers. The dataset was collected across 3 floors of the C4 building of Nazarbayev University.

LICENCE: Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/)

Code Download

ISSAI Datasets