Publication

Integrating Vision-Language Models and Multimodal Retrieval for Real-Time Emergency Response in Healthcare

Injuries and sudden health crises at home demand rapid medical response. We present a lightweight framework that integrates the PrismerZ vision-language model with a key-frame selection algorithm and a multimodal retrieval system to recognize emergencies from video data. The approach combines image captioning and visual question answering with efficient storage and search of embeddings to support healthcare professionals. Evaluated on the Kinetics benchmark (86.5 % image captioning accuracy, 92.5 % visual question answering accuracy) and on a self-collected dataset of emergency scenarios (85.8 % image captioning accuracy, 87.5 % visual question answering accuracy), the system operated within seconds on an embedded edge device. By uniting anomaly detection and multimedia retrieval, the framework extends human activity recognition toward actionable emergency detection in home environments.

Information about the publication

Authors:

Adil Zhiyenbayev, Rakhat Abdrakhmanov, Huseyin Atakan Varol, Adnan Yazici
PDF