ISSAI - Institute of Smart Systems and Artificial Intelligence

Multilingual ASR in Kazakh, English, and Russian Languages

This is the first study of multilingual end-to-end (E2E) automatic speech recognition (ASR) for three languages used in Kazakhstan: Kazakh, Russian, and English. Kazakhstan is a multinational country where Kazakh is the official state language, whereas Russian and English are the languages of interethnic and international communication. In this regard, we initiate the first study of a single joint E2E ASR model applied to simultaneously recognize the Kazakh, Russian, and English languages. We believe that this work will further progress the speech processing research and advance the speech-enabled technology in Kazakhstan and its neighboring countries.

Besides conducting the first detailed study of multilingual E2E ASR for Kazakh, Russian, and English, other contributions of this work are:

We introduce a 7-hour evaluation set of transcribed Kazakh-accented English audio recordings (i.e., native Kazakh speakers reading English sentences extracted from the SpeakingFaces dataset [1]).
We introduce a 334-hour manually-cleaned subset of the OpenSTT [2] dataset for the Russian language, which can also be used to train robust standalone Russian ASR systems.

If you use our dataset for commercial purposes, please add this statement to your product or service:

Our product uses ISSAI Multilingual (Kazakh, Russian, English) Speech Corpus (https://doi.org/10.48342/0qzd-fk83), which is available under a Creative Commons Attribution 4.0 International License.

If you use our dataset for research, please cite it as:

Mussakhojayeva, S., Khassanov, Y., & Varol, H. A. (2021). A Study of Multilingual End-to-End Speech Recognition for Kazakh, Russian, and English. arXiv preprint arXiv:2108.01280.

Here is the demo of the automatic speech recognition system built using ISSAI multilingual speech corpus. Please click the “RECORD” button and speak immediately until the countdown reaches zero. The recognized output will be displayed above the “RECORD” button after 10 seconds. Please note that some browsers don’t support the audio recording feature.

Instructions for using multilingual ASR demo:

Click the “RECORD” button and speak immediately (in Kazakh, Russian, or English) until the countdown reaches zero
The recognized output will be displayed above the “RECORD” button after 10 seconds

Some browser versions don’t support audio recording technology. If this is your case, please, consider using up-to-date browsers on desktop devices.

Download Data Download code

This work is licensed under a Creative Commons Attribution 4.0 International license.

The dataset statistics for the Kazakh, Russian, and English languages. Utterance and word counts are in thousands (k) or millions (M), and durations are in hours (hr). The overall statistics ‘Total’ are obtained by combining the training, validation,and test sets across all the languages.

	Languages		Corpora	Duration	Utterances	Words
1	Kazakh	train	KSC [3]	318.4 hr	147.2k	1.6M
		valid		7.1 hr	3.3k	35.3k
		test		7.1 hr	3.3k	35.9k
2	Russian	train	OpenSTT-CS334	327.1 hr	223.0k	2.3M
		valid	OpenSTT-CS334	7.1 hr	4.8k	48.3k
		test-B (books)	OpenSTT [2]	3.6 hr	3.7k	28.1k
		test-Y (YouTube)	OpenSTT [2]	3.4 hr	3.9k	31.2k
3	English	train	CV-330	330.0 hr	208.9k	2.2M
		valid	CV [4]	7.4 hr	4.3k	43.9k
		test	CV [4]	7.4 hr	4.6k	44.3k
		test-SF (YouTube)	SpeakingFaces [1]	7.7 hr	6.8k	37.7k
4	Total	train	-	975.6 hr	579.3k	6.0M
		valid		21.6 hr	12.4k	127.5k
		test		29.1 hr	22.5k	177.3k

[1] Abdrakhmanova, M., Kuzdeuov, A., Jarju, S., Khassanov, Y., Lewis, M., Varol, H.A.: SpeakingFaces: A large-scale multimodal dataset of voice commands with visual and thermal video streams. Sensors 21(10) (2021).

[2] Slizhikova, A., Veysov, A., Nurtdinova, D., Voronin, D.: Russian open speech to text dataset. https://github.com/snakers4/open_stt accessed: 2021-01-15.

[3] Khassanov, Y., Mussakhojayeva, S., Mirzakhmetov, A., Adiyev, A., Nurpeiissov,M., Varol, H.A.: A crowdsourced open-source Kazakh speech corpus and initial speech recognition baseline. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. pp. 697–706. Association for Computational Linguistics, 2021.

[4] Ardila, R., Branson, M., Davis, K., Kohler, M., Meyer, J., Henretty, M., Morais,R., Saunders, L., Tyers, F.M., Weber, G.: Common voice: A massively-multilingualspeech corpus. In: LREC. pp. 4218–4222. ELRA (2020)

Projects

Multilingual ASR in Kazakh, English, and Russian Languages