ISSAI KAZ-LLM: Kazakhstan’s Large Language Model

In recent years, the field of generative AI, particularly Large Language Models (LLMs), has witnessed remarkable progress, revolutionizing various areas ranging from natural language understanding to creative content generation. Cutting-edge models like OpenAI’s GPT-4o and Google’s Gemini have set new benchmarks, demonstrating unprecedented levels of sophistication and capability. However, these advancements have predominantly catered to high-resource languages like English, Chinese, Japanese, and Russian, leaving a substantial gap for linguistic inclusivity. Recognizing this disparity, numerous countries are now channeling efforts into developing their own national LLMs, aiming to customize these transformative technologies to better serve diverse linguistic landscapes and cultural contexts.

ISSAI is creating the Kazakh Large Language Model (KAZ-LLM) so that Kazakhstan can also benefit from the advances in generative AI and use it to improve the quality of life of its people and drive economic development.

KAZ-LLM will be able to create content in the languages most relevant to Kazakhstan: Kazakh, Russian, and English. KAZ-LLM will play a crucial role in preserving national cultural heritage and will encompass ideological perspectives, historical context, specialized domains, and conversational data that represent Kazakhstan. By tailoring generative AI to meet local needs, Kazakh-LLM exemplifies how national projects can address linguistic gaps and contribute to the global landscape of AI innovation. Most importantly, the KAZ-LLM project is facilitating the creation of an advanced generative AI workforce. Through a hands-on approach in preparing the data, training, and deploying the model, Kazakhstan is fostering a new wave of advanced research personnel capable of creating generative AI models and tools.

How do we collect data for KAZ-LLM?

To build a strong LLM, both high quantity and quality of data are needed. Specifically, LLMs require billions of tokens for training. Tokens are the basic units of textual data and can be words, parts of words (subwords), characters, or even emojis. Currently, the KAZ-LLM training corpus consists of over 72 billion tokens, with 95% of this data collected and curated by the ISSAI team. The tokens for KAZ-LLM are sourced from various public domains, including Kazakh websites, news articles, and documents from online libraries and databases. We also translate high-quality English data into Kazakh and use data provided by various organizations for the project.

over
72 97%
billion tokens
train data collected and curated by the ISSAI team
Multilingual data: Kazakh, English, Russian and Turkish languages

While many companies claim to offer open-source LLMs, they often only provide model weights, withholding the data and training recipes. META’s LLAMA-2 and LLAMA-3 are examples of this. However, the research community often reverse-engineers these recipes, as seen in the OpenLLAMA project. A fully open-source model, OLMo, was developed by the Allen Institute of Artificial Intelligence, featuring a 7-billion parameter architecture with an open recipe, data, and benchmarking scripts. Initially, we experimented with this model using our dataset on the cloud using NVIDIA H100 systems, successfully creating a tokenizer and generating grammatically correct responses in Kazakh and English.

How do we train LLM?

ISSAI began training our 8B model on 23 July 2024 using 8 NVIDIA DGX H100 nodes in the cloud. The model architecture is reminiscent of state-of-the-art examples. We created a new tokenizer optimized for Kazakh by replacing unused tokens, improving efficiency without compromising quality for other languages like English, Russian, and Turkish. This efficiency is crucial for both training and deploying the Kazakh LLM. A real-time demo of a recent model prototype is on our Youtube channel.

The importance of computational resources

Having computational resources is crucial for generative AI, including large language models. Training such models requires immense computational power to process large datasets efficiently and effectively. ISSAI’s local computational hardware consists of a cluster of 4 DGX A100 servers. Moreover, the Institute is renting 8 DGX H100 servers from a cloud provider to train the KAZ-LLM. A single train with the full dataset on these 8 servers takes over a week. Larger models, e.g., 70 billion parameters, will take months to train on these servers. Therefore ISSAI fully supports President Tokayev’s initiative to create a national supercomputer.

The finalized model weights will be shared with the general public

The finalized model (i.e., model weights) will be shared as open-source in December 2024, as a vital component of the soft digital infrastructure of Kazakhstan to stimulate the introduction of generative AI products and services for the Kazakhstani people. The model will also be accessible to the general audience with options for a subscription-based playground and an API for broader usage.

Our partners


ISSAI collaborates with the Ministry of Digital Development, Innovation and Aerospace Industry of the Republic of Kazakhstan, the Ministry of Science and Higher Education of the Republic of Kazakhstan, National Information Technologies company (NIT JSC), National Scientific and Practical Center “Til-Qazyna” named after Shaysultan Shayakhmetov, Sustainable Innovation and Technology Foundation (SITF), Maksut Narikbayev University, and Al-Farabi Kazakh National University in scientific and administrative aspects of the project.


Ministry of Digital Development, Innovations and Aerospace Industry of the Republic of Kazakhstan

Ministry of Science and Higher Education of the Republic of Kazakhstan

Collaborations and media inquiries

We are open to new collaborations and additional support from other companies and institutions. For more information about the project, media inquiries, or to discuss collaboration proposals, please contact us via issai@nu.edu.kz