Kazakh Speech Corpus 2

Kazakh Speech Corpus 2 (KSC2) is the first industrial-scale open-source Kazakh speech corpus. KSC2 corpus subsumes the previously introduced two corpora: Kazakh speech corpus and Kazakh Text-To-Speech 2, and supplements additional data from other sources like tv programs, radio, senate, and podcasts. In total, KSC2 contains around 1.2k hours of high-quality transcribed data comprising over 600k utterances.

Importantly, KSC2 contains utterances with the Kazakh-Russian code-switching, a common conversation practice among Kazakh speakers.

The dataset can be used by professionals to develop various Kazakh speech and language processing applications, such as virtual assistants in the Kazakh language, robots speaking Kazakh, smart homes and cars, voice and text-enabled applications that can also assist people with special needs, and many more.

Like the first version, the KSC2 dataset is freely available to both academic researchers and industry practitioners from ISSAI website.

If you use the ISSAI Kazakh Speech Corpus 2 for commercial purposes, please add this statement to your product or service:

Our product uses ISSAI Kazakh Speech Corpus 2 (, which is available under a Creative Commons Attribution 4.0 International License.

If you use the ISSAI Kazakh Speech Corpus 2 for research, please cite it as:

Mussakhojayeva, S., Khassanov, Y. , Varol, H.A.: KSC2: An Industrial-Scale Open-Source Kazakh Speech Corpus. In: Proceedings of the 23rd INTERSPEECH Conference: pp. 1367-1371. 2022.

Here is the demo of the automatic speech recognition system build using Kazakh Speech Corpus. Please click the “RECORD” button and speak immediately until the countdown reaches zero. The recognized output will be displayed above the “RECORD” button after 10 seconds. Please note that some browsers don’t support the audio recording features.

  • Click the “RECORD” button and speak immediately (in Kazakh language) until the countdown reaches zero
  • The recognized output will be displayed above the “RECORD” button after 10 seconds

In some models of browsers technology of audio records is not supported. If this is your case, please, consider using up-to-date browsers in desktop devices.

GitHub icon
Powered by GitHub