The goal of Speech Command Recognition (SCR) is to detect a particular set of predefined words or phrases from a speech signal and to trigger a specific response or action based on the identified keyword. Since SCR requires less demanding hardware compared to general-purpose speech recognition, SCR can run on edge devices and embedded systems with low power consumption.
SCR is widely used in applications such as voice-controlled smart home devices, personal digital assistants, robotics, IoT, and industrial automation. Additionally, it can be used in security and surveillance systems to detect specific trigger words that alert law enforcement or security personnel of potential threats. Most of the SCR systems were developed for the English language due to the large-scale Google Speech Commands Dataset (GSCD). This dataset is available in two versions, V1 (30 keywords) and V2 (35 keywords), and provides a wide range of English keyword recordings.
To bolster these technologies in the Kazakh language, we have developed the Kazakh Speech Commands Dataset (KSCD). The dataset consists of 35 keywords translated from the GSCD-V2. The Kazakh keywords are as follows: “артқа”, “алға”, “оңға”, “солға”, “төмен”, “жоғары”, “жүр”, “тоқта”, “қос”, “өшір”, “иә”, “жоқ”, “үйрен”, “орында”, “нөл”, “бір”, “екі”, “үш”, “төрт”, “бес”, “алты”, “жеті”, “сегіз”, “тоғыз”, “оқы”, “жаз”, “төсек”, “құс”, “мысық”, “ит”, “бақытты”, “үй”, “ағаш”, “көрнекі”, “мәссаған”. The recordings are 1-second duration and in a WAV format with a sampling rate of 16 kHz. In total, 119 participants (62 males, 57 females) participated in data collection via a telegram bot. The collected dataset underwent a manual evaluation by moderators to remove any subpar samples, including incomplete or incorrect readings, as well as quiet or silent recordings. The final dataset contains 3,623 recordings. You can still contribute to the development of SCR systems for the Kazakh language by participating in data collection via the telegram bot: https://t.me/kz_commands_collector_bot. With more data, we can build a more robust model!
To validate the efficacy of KSCD, we trained and evaluated a state-of-the-art SCR model – Keyword-MLP. The model achieved 97% accuracy on the test set. We have made the dataset, source code, and pre-trained models publicly available at our GitHub repository. Also, we created comprehensive tutorials about the project. The videos are available on our YouTube channel. Also, we recommend reading our paper, “Speech Command Recognition: Text-to-Speech and Speech Corpus Scraping Are All You Need”, for more detailed information.
In addition, we trained a tiny model (538.25 KB) on Edge Impulse and deployed it on the Arduino Nicla Voice (shown in photos) which is one of the most advanced TinyML development boards available on the market. The Arduino Nicla Voice development board incorporates cutting-edge features, including a high-performance microphone, an IMU (Inertial Measurement Unit), a Cortex-M4 Nordic nRF52832 MCU (Microcontroller Unit), and the Syntiant NDP120 Neural Decision Processor. For more information, please read our comprehensive tutorial, “Small-Footprint Keyword Spotting for Low-Resource Languages” on Hackster.