After focusing on individual languages for a long time, multilingual automatic speech recognition has recently become an active area of research. For instance, Whisper by OpenAI is capable of recognizing speech in 99 languages. However, the performance of Whisper is significantly lower for low-resource languages than for high-resource ones. In this work, we aim to address this and present a fine-tuning strategy for the pre-trained Whisper model so that its performance is improved for a low-resource language family while maintaining performance for a set of high-resource languages.
Specifically, our Söyle model exhibited high performance for both the Turkic language family (11 languages) and the official languages of the United Nations. Our work also presents the first large open-source speech corpus for the Tatar language which was created together with the Institute of Applied Semiotics of Tatarstan Academy of Sciences. We demonstrate that speech recognition performance for Tatar improves with the model trained using the new Tatar Speech Corpus (TatSC). Our model is also trained to be noise-robust and to perform long-form transcription. We open-source our model and TatSC to encourage further research. We envision that our fine-tuning approach will guide the creation multilingual speech recognition models for other low-resource language families.
If you use the ISSAI and Institute of Applied Semiotics of Tatarstan Academy of Sciences Söyle: Noise-Robust Multilingual Speech Recognition with Long Transcription Featuring the Tatar Speech Corpus for commercial purposes, please add this statement to your product or service:
Our product uses Söyle: Noise-Robust Multilingual Speech Recognition with Long Transcription Featuring the Tatar Speech Corpus (doi: 10.48342/hkc6-yq77), which is available under a Creative Commons Attribution 4.0 International License (Creative Commons — Attribution 4.0 International — CC BY 4.0).
If you use the ISSAI and Institute of Applied Semiotics of Tatarstan Academy of Sciences Söyle: Noise-Robust Multilingual Speech Recognition with Long Transcription Featuring the Tatar Speech Corpus for research, please cite it as:
Saida Mussakhojayeva, Rinat Gilmullin, Daniil Orel, Bulat Khakimov, Adal Abilbekov, Mansur Galimov and Huseyin Atakan Varol. Söyle: Noise-Robust Multilingual Speech Recognition with Long Transcription Featuring the Tatar Speech Corpus. MDPI
We thank the Institute of Applied Semiotics of Tatarstan Academy of Sciences for fruitful collaboration on creation of the Tatar Speech Corpus.
Demo instructions:
Please click the “RECORD” button and you can speak up to 60 minutes. After you finish recording, please, click the “STOP” button and after “UPLOAD” button. Please note that some browsers don’t support the audio recording features.
Please upload your file using “CHOOSE FILE” button and after you upload the file, please click “UPLOAD” button.
When the final text appears you can get the transcribed text by clicking the “Download text” button.
Only following formats supported: wav, mp4, flac, mp3.
Click start button to record