During the week of 20–25 June, 2022, the 13th Edition of the Language Resources and Evaluation
Conference (LREC) was held in Marseille, France. While the central streets of the Mediterranean city were crowded with hordes of tourists suffocating under the scorching sun, one of Marseille’s most beautiful landmarks – the Palais du Pharo – opened its doors to nearly a thousand conference participants from all over the world. For six days, the palace, built in 1858 by Emperor Napoleon III for Empress Eugénie, was filled with an informal atmosphere and the voices of participants giving presentations covering the full range of natural language processing subjects from part-of-speech tagging to statistical methods and machine learning.
For the first time in the history of LREC, the Institute of Smart Systems and Artificial Intelligence (ISSAI)
presented two papers – KazakhTTS2: Extending The Open-Source Kazakh TTS Corpus With More Data,
Speakers, And Topics by Saida Mussakhojayeva, Yerbolat Khassanov, Huseyin Atakan Varol and KazNERD: Kazakh Named Entity Recognition Dataset by Rustem Yeshpanov, Yerbolat Khassanov, Huseyin Atakan Varol. While the KazakhTTS2 paper, offering a greatly expanded version of the previously published Kazakh text-to- speech synthesis corpus, was presented remotely by ISSAI postdoctoral scholar Yerbolat Khassanov, ISSAI technical writer Rustem Yeshpanov had the luxury of giving an on-site poster presentation on KazNERD.
KazNERD is the largest publicly available dataset for Kazakh named entity recognition. The dataset was
annotated according to guidelines developed by ISSAI specifically for and in the Kazakh language. ISSAI also developed four state-of-the-art machine learning models to automate Kazakh named entity recognition, with the best model achieving an F 1 -score of 97.22%. The KazNERD poster presentation was unanimously well received by conference attendees. They gave high praise to the informativeness of the poster as well as its bright colours and illustrations.
Rustem Yeshpanov also participated in a workshop entitled “Dataset Creation for Lower-Resourced
Languages”, where he was able to share his first-hand experience of creating an annotated dataset for Kazakh. At the conference, Rustem had the honour of meeting ELRA President António Branco, ELRA Honorary President Nicoletta Calzolari, and ELRA Secretary General Khalid Choukri in person and gave them an informal presentation on ISSAI’s past and ongoing projects. President Branco left a congratulatory note on the KazNERD poster and signed it. Rustem was also able to establish friendly contacts with researchers, such as Chihiro Taguchi (Nara Institute of Science and Technology / University of Edinburgh), Sardana Ivanova (University of Helsinki), Jonne Sälevä (Brandeis University), and Buse Çarık (Sabanci University), whose research focuses on Turkic languages.
It is worth noting that there were four research projects from Kazakhstan at LREC. All the projects
represented Nazarbayev University. Besides KazakhTTS2 and KazNERD, the other two papers were
Crowdsourcing Kazakh-Russian Sign Language: DailySigners-50 by Medet Mukushev, Aidyn Ubingazhibov, Aigerim Kydyrbekova, Alfarabi Imashev, Vadim Kimmelman, and Anara Sandygulova and Cyrillic-MNIST: a Cyrillic Version of the MNIST Dataset by Bolat Tleubayev, Zhanel Zhexenova, and Anara Sandygulova from the Department of Robotics and Mechatronics.