Publication

MMHAR-28: Human Action Recognition Across RGB, Thermal, Depth, and Event Modalities

Currently, the majority of human action recognition (HAR) models are trained on RGB video datasets. However, RGB cameras have limitations such as motion blur, low illumination, privacy concerns, and the lack of 3D spatial details. Existing HAR datasets are often unimodal or lack consistent protocols across different modalities, hindering robust multimodal learning. To bridge this gap, we present MMHAR-28, an extensive multimodal HAR dataset featuring 10,298 videos representing 28 actions captured across four camera types: RGB, depth, thermal, and event-based sensors. It includes various daily activities, sports exercises, and multi-person interactions, providing a unified benchmark for multimodal learning. To show the efficacy of MMHAR-28, we trained various HAR models such as VideoMamba, UniFormerV2, and TSM, examining the effects of increased temporal context, modality-specific performance, and the benefits of multimodal feature generalization. Notably, the multimodal Uniformerv2-MM-32 model achieved a top-1 accuracy of 97.86% on RGB, 90.71% on depth, 86.58% on event, and 98.57% on thermal modalities. We have made our dataset, source code, and models publicly available at https://github.com/IS2AI/MMHA-28 to support research in this area.

Information about the publication

Authors:

Rakhimzhanova Tomiris, Kuzdeuov Askat, Muratov Artur, Varol Huseyin Atakan
PDF