Kazakh Speech Corpus

The Kazakh speech corpus (KSC) contains around 335 hours of transcribed audio comprising over 154,000 utterances spoken by participants of both genders and from various age groups, as well as representing different regions of Kazakhstan. The corpus was carefully inspected by native Kazakh speakers to ensure high quality and is the largest publicly available database developed to advance various Kazakh speech and language processing applications, such as speech recognition, speech synthesis, and speaker recognition. The KSC database is available for public and commercial use upon request under the Creative Commons Attribution 4.0 International License.

You can download the dataset by filling in the request form at the bottom of this webpage (“Download KSC Data”). Alternatively, you can download the dataset from this link (link: http://www.openslr.org/102/).

If you use the ISSAI Kazakh Speech Corpus for commercial purposes, please add the following statement to your product or service:

Our product uses the ISSAI Kazakh Speech Corpus (https://doi.org/10.48342/gkg9-gn84), which is available under the Creative Commons Attribution 4.0 International License.

If you use the ISSAI Kazakh Speech Corpus for research, please cite it as follows:

Y. Khassanov, S. Mussakhojayeva, A. Mirzakhmetov, A. Adiyev, M. Nurpeiissov and H. A. Varol. “A Crowdsourced Open-Source Kazakh Speech Corpus and Initial Speech Recognition Baseline”. arXiv preprint arXiv:2009.10334 (2020).

Here is a demo of the automatic speech recognition system built using the Kazakh Speech Corpus. Please click the “RECORD” button and speak immediately until the countdown reaches zero. The recognized output will be displayed above the “RECORD” button after 10 seconds. Please note that some browsers may not support the audio recording features.

Some browsers may not support the technology of audio recording. If this is your case, please, consider using up-to-date browsers on desktop devices.


Kazakh Speech Corpus

Y. Khassanov, S. Mussakhojayeva, A. Mirzakhmetov, A. Adiyev, M. Nurpeiissov and H. A. Varol. “A Crowdsourced Open-Source Kazakh Speech Corpus and Initial Speech Recognition Baseline”. arXiv preprint arXiv:2009.10334 (2020). Link to paper: https://arxiv.org/abs/2009.10334. DOI link:  https://doi.org/10.48342/gkg9-gn84