ISSAI - Institute of Smart Systems and Artificial Intelligence

Kazakh Speech Corpus 2

Kazakh Speech Corpus 2 (KSC2) is the first industrial-scale open-source Kazakh speech corpus. KSC2 corpus subsumes the previously introduced two corpora: Kazakh speech corpus and Kazakh Text-To-Speech 2, and supplements additional data from other sources like tv programs, radio, senate, and podcasts. In total, KSC2 contains around 1.2k hours of high-quality transcribed data comprising over 600k utterances.

Importantly, KSC2 contains utterances with the Kazakh-Russian code-switching, a common conversation practice among Kazakh speakers.

The dataset can be used by professionals to develop various Kazakh speech and language processing applications, such as virtual assistants in the Kazakh language, robots speaking Kazakh, smart homes and cars, voice and text-enabled applications that can also assist people with special needs, and many more.

Like the first version, the KSC2 dataset is freely available to both academic researchers and industry practitioners from ISSAI website.

If you use the ISSAI Kazakh Speech Corpus 2 for commercial purposes, please add this statement to your product or service:

Our product uses ISSAI Kazakh Speech Corpus 2 (https://doi.org/10.48342/m90y-aj02), which is available under a Creative Commons Attribution 4.0 International License.

If you use the ISSAI Kazakh Speech Corpus 2 for research, please cite it as:

Mussakhojayeva, S., Khassanov, Y. , Varol, H.A.: KSC2: An Industrial-Scale Open-Source Kazakh Speech Corpus. In: Proceedings of the 23rd INTERSPEECH Conference: pp. 1367-1371. 2022.

Download Data Download code

This work is licensed under a Creative Commons Attribution 4.0 International license.

Projects

Kazakh Speech Corpus 2