The Uzbek speech corpus (USC) has been developed in collaboration between ISSAI and the Image and Speech Processing Laboratory in the Department of Computer Systems of the Tashkent University of Information Technologies (https://tuit.uz/en/kompyuter-tizimlari). The USC comprises 958 different speakers with a total of 105 hours of transcribed audio recordings. To ensure high quality, the USC has been manually checked by native speakers. The USC is primarily designed for automatic speech recognition (ASR), however, it can also be used to aid other speech-related tasks, such as speech synthesis and speech translation. To the best of our knowledge, the USC is the first open-source Uzbek speech corpus available for both academic and commercial use under the Creative Commons Attribution 4.0 International License. We expect that the USC will be a valuable resource for the general speech research community and become the baseline dataset for Uzbek ASR research.
The Uzbek speech data collection project is ongoing, if you wish to contribute please visit: https://usc.spai.uz/en/
If you use USC corpus for commercial purposes, please add this statement to your product or service:
Our product uses ISSAI and TUIT Uzbek Speech Corpus (https://doi.org/10.48342/drss-8q87), which is available under a Creative Commons Attribution 4.0 International License.
If you use USC corpus for research, please cite it as:
Musaev, M., Mussakhojayeva, S., Khujayorov, I., Khassanov, Y., Ochilov, M., & Varol, H. A. (2020). USC: An Open-Source Uzbek Speech Corpus and Initial Speech Recognition Experiments. arXiv preprint arXiv:2107.14419.
The USC dataset specifications.
| Category | Train | Valid | Test | Total | |
|---|---|---|---|---|---|
| 1 | Duration (hours) | 96.4 | 4.0 | 4.5 | 104.9 |
| 2 | # Utterances | 100,767 | 3,783 | 3,837 | 108,387 |
| 3 | # Words | 569.0k | 22.5k | 27.1k | 618.6k |
| 4 | # Unique Words | 59.5k | 8.4k | 10.5k | 63.1k |
| 5 | # Speakers | 879 | 41 | 38 | 958 |