A blended ensemble approach for accurate human activity recognition
Abstract
Human activity recognition (HAR) is a novel computer vision area with applications in fashion, entertainment, healthcare, and urban planning. Previously, convolutional neural networks (CNNs) were used in HAR due to their ability to extract spatial features from images. However, CNNs are not effective in processing varying input sizes and long-range dependencies in complex human motions. This work examines another approach using vision transformers (ViT) and swin transformers (SwinT) that process images as patch sequences and perform self-attention. These models particularly excel in learning global relationships and minor motion changes in body motion and are therefore very well-suited to variegated and subtle activity detection. To further enhance recognition performance, we propose a hybrid ensemble method by combining ViT and SwinT models with different scales (small, base, and large). Experimental outcomes show that while single transformer models are competitive, the hybrid ensemble beats them across the board with the highest accuracy and balanced precision, recall, and F1-score. These findings confirm that the intended ensemble model provides a more scalable and robust solution than either single-model or CNN-based approaches, and this encourages accurate human activity recognition.
Keywords
Ensemble model; Human activity recognition; Recognition applications; Scalable vision models; Transformer architectures
Full Text:
PDFDOI: http://doi.org/10.11591/ijai.v14.i6.pp5131-5139
Refbacks
- There are currently no refbacks.
Copyright (c) 2025 Rezwana Karim, Afsana Begum, Miskatul Jannat, Abu Kowshir Bitto

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
IAES International Journal of Artificial Intelligence (IJ-AI)
ISSN/e-ISSN 2089-4872/2252-8938
This journal is published by the Institute of Advanced Engineering and Science (IAES).