31st May 2024

Elevating Kazakhstan’s AI Capabilities: ISSAI’s Transformation and Ambitious Kazakh Large Language Model Project

Following the directive from President Kassym-Jomart Tokayev at the Digital Bridge 2023, the Institute of Smart Systems and Artificial Intelligence (ISSAI) has been elevated to a full-fledged research institute under Nazarbayev University. This significant transformation, which took place in May 2024, marks a milestone in Kazakhstan’s technological and academic advancement. Now operating as an autonomous private institution, ISSAI employs a dedicated team of 4 administrative staff, 50 researchers (data scientists, research assistants, and computer engineers), and 17 data curators under the leadership of the founding director Dr. Huseyin Atakan Varol.

Development of the Kazakh Large Language Model

With the financial support of Astana Hub (Ministry of the Digital development and aerospace industry), the Nazarbayev Fund, and the Social Development Fund of Nazarbayev University, ISSAI is developing a Kazakh Large Language Model (LLM), a project set to revolutionize artificial intelligence (AI) capabilities in Kazakhstan and the Central Asian region. The institute has already embarked on the ambitious task of creating at least two LLMs. The initial phase involves training a seven-billion parameter model based on OLMo to produce a prototype capable of interaction in the Kazakh language. Furthermore, ISSAI aims to explore larger model architectures, such as those with 13 billion parameters (e.g., Llama and Mistral), which will not only facilitate Kazakh language interactions but also execute retrieval-augmented generation. Despite the current lack of supercomputers in Kazakhstan, ISSAI has initiated training using a cloud computing platform featuring a modest number of NVIDIA H100 nodes.

Multilingual and Multicultural Focus

The training corpus for these advanced models will consist of at least 100 billion tokens, incorporating Kazakh, Russian, English, and Turkish languages, with each language contributing 25 billion tokens. This multilingual approach reflects Kazakhstan’s diverse cultural landscape, enabling the models to engage fluently in the state language of Kazakh, the widely spoken Russian, the globally integrative English, and Turkish, a Turkic language with abundant publicly available data and linguistic similarities to Kazakh. This initiative is expected to significantly impact Kazakhstani society and economy. The resulting soft digital infrastructure will ensure that products and services can be provided to the Kazakhstani people, enhancing local accessibility and offering export potential.

Data Sources and Language Processing Capabilities

The diverse data sources for this project include Wikipedia articles, news agencies, state-related websites, and open-source datasets (e.g., Common Crawl), all publicly accessible. Over the past five years, ISSAI has developed numerous natural language processing datasets specifically for the Kazakh language. These rich and varied datasets will empower the Kazakh LLM to excel in a wide array of tasks, such as question answering, text summarization, translation, and named entity recognition.

Model Release and Future Plans

The first Kazakh LLM is slated for completion on December 16, 2024, coinciding with the 33rd anniversary of the Independence of the Republic of Kazakhstan. The model, including its weights, will be released as open-source software, forming a critical component of Kazakhstan’s digital infrastructure. To facilitate widespread use, ISSAI will offer a subscription-based playground for general users and a specialized application programming interface (API) for advanced users to integrate the model into their products. The playground will support interaction with the models, reinforcement learning from human feedback, and fine-tuning for optimized performance in various scenarios. The API will enable seamless integration into websites, smartphone apps, scripts, and personal computer programs.

Foundational Speech Model Development

Leveraging its extensive experience in automatic speech recognition, text-to-speech generation, and neural machine translation, ISSAI is developing a foundational speech model. This model will support streaming speech recognition, text-to-text translation, text-to-speech generation, speech-to-text translation, and speech-to-speech translation. It will be integrated into both the ISSAI playground and the Kazakh LLM API, facilitating speech-based interaction.

Training Programs for State and Private Sectors

Recognizing the significant demand for AI training among civil servants and middle- to high-level company management, ISSAI is developing a comprehensive paid training program tailored for both the state and private sectors. The program will cover essential AI topics, including machine learning, deep learning, AI infrastructures, foundation models, modern AI tools, and AI ethics. Participants will gain both theoretical and practical knowledge.

Moreover, ISSAI plans to launch a follow-up program designed to familiarize the general public with the Kazakh LLM soon after its release. This supplementary program will cover important aspects, such as prompt engineering, retrieval-augmented generation, and ISSAI’s playground and API, ensuring users can effectively utilize the Kazakh LLM.

Collaboration and Future Vision

ISSAI extends a warm invitation to local collaborators, calling for passionate and dedicated partners to join this groundbreaking initiative. Emphasizing the need for committed professionals who are ready to invest their expertise, resources and energy, ISSAI seeks to forge strong alliances that will drive this ambitious project forward. This initiative represents a transformative opportunity to nurture a new generation of intellectual leaders and innovators, poised to spearhead the development and implementation of cutting-edge generative AI technologies in Kazakhstan. Together we can build a robust foundation for a future where Kazakhstan stands at the forefront of the global AI revolution.