Injuries and sudden health crises at home demand rapid medical response. We present a lightweight framework that integrates the PrismerZ vision-language model with a key-frame selection algorithm and a multimodal retrieval system to recognize emergencies from video data. The approach combines image captioning and visual question answering with efficient storage and search of embeddings to support healthcare professionals. Evaluated on the Kinetics benchmark (86.5 % image captioning accuracy, 92.5 % visual question answering accuracy) and on a self-collected dataset of emergency scenarios (85.8 % image captioning accuracy, 87.5 % visual question answering accuracy), the system operated within seconds on an embedded edge device. By uniting anomaly detection and multimedia retrieval, the framework extends human activity recognition toward actionable emergency detection in home environments.