The fusion of audio, visual, and thermal modalities has proven effective in developing reliable person verification systems. In this study, we enhanced multimodal person verification performance by augmenting training data using domain transfer methods. Specifically, we enriched the audio-visual-thermal SpeakingFaces dataset with a combination of real audio-visual data and synthetic thermal data from the VoxCeleb dataset. We adapted visual images in VoxCeleb to the thermal domain using CycleGAN, trained on SpeakingFaces. Our results demonstrate the positive impact of augmented training data on all unimodal and multimodal models. The score fusion of unimodal audio, unimodal visual, bimodal, and trimodal systems trained on the combined data achieved the best results on both datasets and exhibited robustness in low-illumination and noisy conditions. Our findings emphasize the importance of utilizing synthetic data, produced by generative methods, to improve deep learning model performance. To facilitate reproducibility and further research in multimodal person verification, we have made our code, pretrained models, and preprocessed dataset freely available in our GitHub repository.