Publication

Introducing Cultural Knowledge in Language Models: KazCulture Dataset for Kazakh Culture

Although Large Language Models (LLMs) have achieved significant advances in linguistic fluency, they often suffer from a lack of cultural knowledge associated with low-resource languages. This deficiency could challenge their integration into high-stakes applications across diverse regions. In this paper, we present a systematic method for embedding nation-specific cultural knowledge into LLMs, using Kazakh culture and language as a case study. We present KazCulture, a robust Kazakh culture-specific dataset composed of 16,137 human-crafted Passage-Question-Answer (PQA) triplets. KazCulture is rigorously curated from 11 books related to culture and the Koshpendiler.kz digital archive, capturing deep cultural semantics in areas such as customs, traditions, beliefs, cuisine, and household practices. Using KazCulture, we evaluated 36 LLMs, including proprietary frontier models and open-source alternatives. Our benchmarking reveals a critical disparity: while proprietary models like GPT-5 and Gemini-2.5-Pro achieved up to 80% accuracy, open-source models struggled significantly. To bridge this gap, we propose a two-stage adaptation pipeline: 1) perform fine-tuning on a general multilingual dataset (ISSAI-SFT) for linguistic robustness, then 2) run targeted fine-tuning on KazCulture. This method boosted the accuracy of the Qwen3-32B model from a baseline of 39.51% to 64.54%. KazCulture provides a timely contribution to Artificial Intelligence (AI) research as both a rigorous benchmark for Kazakh-culture-related knowledge and a training resource to develop different culture aware LLMs. The dataset is available at https://huggingface.co/datasets/issai/KazCulture

Information about the publication

Authors:

Akylbek Maxutov; Batyr Arystanbekov; Zhanat Makhataeva; Adil Yergen; Nurbek Taizhanov; Guldana Nauryzbaikyzy
PDF