Projects

Uzbek Speech Corpus and Automatic Speech Recognition

The Uzbek speech corpus (USC) has been developed in collaboration between ISSAI and the Image and Speech Processing Laboratory in the Department of Computer Systems of the Tashkent University of Information Technologies (https://tuit.uz/en/kompyuter-tizimlari). The USC comprises 958 different speakers with a total of 105 hours of transcribed audio recordings. To ensure high quality, the USC has been manually checked by native speakers. The USC is primarily designed for automatic speech recognition (ASR), however, it can also be used to aid other speech-related tasks, such as speech synthesis and speech translation. To the best of our knowledge, the USC is the first open-source Uzbek speech corpus available for both academic and commercial use under the Creative Commons Attribution 4.0 International License. We expect that the USC will be a valuable resource for the general speech research community and become the baseline dataset for Uzbek ASR research.

The Uzbek speech data collection project is ongoing, if you wish to contribute please visit: https://usc.spai.uz/en/

If you use USC corpus for commercial purposes, please add this statement to your product or service:

Our product uses ISSAI and TUIT Uzbek Speech Corpus (https://doi.org/10.48342/drss-8q87), which is available under a Creative Commons Attribution 4.0 International License.

If you use USC corpus for research, please cite it as:

Musaev, M., Mussakhojayeva, S., Khujayorov, I., Khassanov, Y., Ochilov, M., & Varol, H. A. (2020). USC: An Open-Source Uzbek Speech Corpus and Initial Speech Recognition Experiments. arXiv preprint arXiv:2107.14419.

Here is the demo of the automatic speech recognition system built using Uzbek speech corpus. Please click the “RECORD” button and speak immediately until the countdown reaches zero. The recognized output will be displayed above the “RECORD” button after 10 seconds. Please note that some browsers don’t support the audio recording features:

Instructions for using Uzbek ASR demo:

  • Click the “RECORD” button and speak immediately (in Uzbek language) until the countdown reaches zero
  • The recognized output will be displayed above the “RECORD” button after 10 seconds

Some browser versions don’t support audio recording technology. If this is your case, please, consider using up-to-date browsers on desktop devices.

GitHub icon
Powered by GitHub

The USC dataset specifications.

Category Train Valid Test Total
1 Duration (hours) 96.4 4.0 4.5 104.9
2 # Utterances 100,767 3,783 3,837 108,387
3 # Words 569.0k 22.5k 27.1k 618.6k
4 # Unique Words 59.5k 8.4k 10.5k 63.1k
5 # Speakers 879 41 38 958