TMA-Net: a transformer-based multi-modal attention network for abnormal behavior detection
Abstract
Abnormal behavior detection in crowded environments remains challenging due to complex motion patterns, occlusions, and domain variability. This paper presents transformer-based multi-modal attention network (TMA-Net), a unified framework that integrates red, green, and blue (RGB), optical flow (OF), and heat map (HM) modalities through a dual-stage attention fusion mechanism. The system employs you only look once version 11 (YOLOv11) for human localization and vision transformer (ViT)-B/16 for feature encoding, followed by intra-modal self-attention and cross-modal fusion to capture fine-grained spatial–temporal and motion energy dependencies. Extensive experiments on six public benchmarks as UMN, Crowd-11, UBNormal, ShanghaiTech, CUHK Avenue, UCSD Ped2, and EPUAbN dataset, demonstrate that TMA-Net achieves up to 97.5% area under the curve (AUC) and 96–100% accuracy, outperforming previous other state-of-the-art approaches. These results highlight the framework’s strong generalization and robustness across both single- and cross-dataset evaluations, underscoring its potential for reliable deployment in real intelligent surveillance systems.
Keywords
Abnormal dectection; Attention network; Convolutional neural network; Spatial-temporal; Transformer
Full Text:
PDFDOI: http://doi.org/10.11591/ijai.v15.i2.pp1441-1450
Refbacks
- There are currently no refbacks.
Copyright (c) 2026 Huong-Giang Doan, Ngoc-Trung Nguyen

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
IAES International Journal of Artificial Intelligence (IJ-AI)
ISSN/e-ISSN 2089-4872/2252-8938
This journal is published by the Institute of Advanced Engineering and Science (IAES).