Publication

KSC2: An Industrial-Scale Open-Source Kazakh Speech Corpus

Abstract

We present the first industrial-scale open-source Kazakh speech corpus for automatic speech recognition research and devel- opment. Our corpus subsumes two previously presented cor- pora: 1) Kazakh speech corpus (KSC) and 2) Kazakh text- to-speech 2 (KazakhTTS2). We also provide additional data from other sources, including television news, television and ra- dio programs, parliament speeches, and podcasts. Our corpus, which we have named KSC2, contains over a thousand hours of high-quality transcribed data, which is triple the size of KSC. KSC2 was manually transcribed with the help of native Kazakh speakers and validated via preliminary speech recognition ex- periments on various evaluation sets. Moreover, it contains ut- terances with Kazakh-Russian code-switching, a conversational practice common among Kazakh speakers. We believe that our corpus will facilitate speech processing research for Kazakh, which is widely considered an under-resourced language. To ensure the reproducibility of experiments, we share the KSC2 corpus, training recipes, and pretrained models.

Information about the publication

Authors:

Saida Mussakhojayeva, Yerbolat Khassanov, Huseyin Atakan Varol
Data PDF