English remains the digital lingua franca, dominating nearly half of all public domain content on the Internet, while most other languages have only a limited digital footprint. Despite being spoken in a country of more than 21 million people, Kazakh represents just 0.035% of global web content, based on data from the Common Crawl 2026 (source: https://commoncrawl.github.io/cc-crawl-statistics/plots/languages).
A small population is not necessarily an obstacle for strong digital presence. Estonia provides a compelling example: despite having a population of only 1.3 million people, Estonian accounts for 0.168% of publicly accessible web content, according to the Common Crawl 2026. When adjusted for population size, Estonia’s raw data wealth per capita (the amount of publicly accessible web data relative to the size of the population) is approximately 76 times higher than Kazakhstan’s.
The availability of publicly accessible web data is particularly important because this is where the story of AI begins: LLMs and other generative AI systems are trained using data collected from the public web. Languages with richer online representation are therefore better positioned to participate in and benefit from the rapidly evolving AI ecosystem.
At the Institute of Smart Systems and Artificial Intelligence (ISSAI) of Nazarbayev University, we are working to address this gap by expanding the availability of high-quality digital Kazakh-language resources (models and datasets). These efforts have been recognized not only locally but also by the international research community. At an Association for Computational Linguistics (ACL) conference, Kazakhstan was recognized as “The Rising Star” for its expanding digital footprint and diverse multimodal datasets, with ISSAI acknowledged as the key contributor to this progress (source: https://aclanthology.org/2025.loreslm-1.25.pdf).
ISSAI’s public repositories on Hugging Face and GitHub, collectively the largest open-source collection of AI models and datasets in Central Asia, comprise 18 open-source models and 50 datasets. These include the KazLLM, the first large language model for Kazakh; benchmarks for evaluating LLMs and Vision-Language models in Kazakh such as MMBench-Kazakh, MMLU_Redux_2.0_Kazakh, Beynele-Bench; speech and translation technologies including Tilmash, KazakhTTS, KazakhTTS2, and multimodal models such as Qolda-AVL and Beynele, and many more. These resources make Kazakh increasingly accessible to the global digital community and strengthen its presence in the digital age.
ISSAI’s models and datasets have already been downloaded more than 500 000 times by organizations and institutes from 38 countries and are being successfully integrated into the products and services of both international technology companies, including Meta, OpenAI, Qwen and DeepSeek and leading Kazakhstani companies, such as Kaspi, Freedom, Beeline, AstanaHub, and Halyk. The ISSAI models and datasets were also downloaded by leading local universities such as Nazarbayev University, Al-Farabi KazNU, Astana IT University, and Satbayev University, as well as international universities including National University of Singapore, Queen Mary University of London, MBZUAI, and others. ISSAI will continue contributing to the digitalization of the Kazakh language by developing AI models and datasets and by openly sharing them, in line with our principles of “AI for Good” and “AI for Kazakhstan.”
Ultimately, the future of the Kazakh language in the digital age will depend on collective action. Thus, ISSAI invites technology companies, research institutions, and other stakeholders in Kazakhstan to contribute by openly sharing Kazakh language datasets and resources. Such open-source contributions will serve as a foundation for future research, AI models, and innovative products and services that benefit the people of Kazakhstan and strengthen the country’s digital future.