Vision-language models (VLMs) have made remarkable progress in multimodal reasoning. However, low-resource languages such as Kazakh remain underserved due to the scarcity of training data, the absence of dedicated evaluation benchmarks, and overall limited research attention. This paper presents Qolda, one of the first VLMs specifically designed for the Kazakh language. Built upon Qwen3-4B and integrated into the InternVL3.5 architecture, Qolda comprises 4.3 billion parameters, enabling deployment on consumer hardware while supporting both textual and visual inputs in Kazakh. We present a systematic four-stage adaptation pipeline that combines established techniques, comprising 1) language model adaptation through supervised fine-tuning (SFT) on multilingual instruction samples, 2) vision-language alignment via projection layer training, 3) joint fine-tuning on chain-of-thought (CoT) reasoning traces, and 4) Mixed Preference Optimization (MPO) for refining model capabilities and reducing hallucinations. To support training and evaluation, we construct large-scale trilingual multimodal datasets spanning Kazakh, Russian, and English, and introduce the first vision-language benchmarks for the Kazakh language. Experimental results demonstrate that Qolda achieves substantial improvements over baseline models on Kazakh-specific tasks while maintaining competitive performance in English. The model and datasets are publicly released to facilitate further research in low-resource vision-language modeling.