Noise-Robust Automatic Speech Recognition for Industrial and Urban Environments

Automatic Speech Recognition (ASR) models can achieve human parity, but their performance degrades significantly when used in noisy industrial and urban environments. In this paper, we present monolingual and multilingual ASR models, which can perform effectively even in extreme noise conditions. Specifically, we first generated a large synthetic noise-augmented dataset for Kazakh and English speech, and then used this data to train mono- and multilingual ASR models using state-of-the-art deep learning architectures. To evaluate our models, we have compared them to models trained on original data. The results showed that our models outperform the original ones in terms of word error rate (WER). For the monolingual case, the average WER of our model is 25.1%, while the original model yields 39.9% WER. For the multilingual case, we created a noise-robust customization of the Whisper model, which has a significantly improved performance on Kazakh (for both original and noisy data) and resulted on average 10% performance improvement on noisy English audios. Finally, we tested our models in an industrial setting, which has proven that our models are robust to unseen types of noise and can be used for real-life applications.

Information about the publication


Daniil Orel, Huseyin Atakan Varol