Skeleton-Based Action Recognition Method Using Spatial-Temporal Graph Convolutional Networks

Mehmet Kaya; Gül Yıldız; Can Özcan

doi:10.64972/dea.2026.v5i1.16911d:147-160

Authors

Mehmet Kaya Faculty of Engineering, Department of Computer Engineering, Istanbul Technical University, 34469 Istanbul, Turkey
Gül Yıldız Faculty of Engineering, Department of Computer Engineering, Istanbul Technical University, 34469 Istanbul, Turkey
Can Özcan Faculty of Engineering, Department of Computer Engineering, Istanbul Technical University, 34469 Istanbul, Turkey

DOI:

https://doi.org/10.64972/dea.2026.v5i1.16911d:147-160

Keywords:

Pattern Recognition, Skeleton-Based Action Recognition, Spatial-Temporal Graph Convolutional Network, Attention Mechanism, Temporal Modeling

Abstract

This paper introduces a skeleton-based action recognition method. This method uses Spatio-Temporal Graph Convolutional Networks (ST-GCN) and adaptive attention mechanisms. Currently, skeletal data is the main target for human action recognition in pattern recognition because they are smaller, less affected by noise, and have certain structural characteristics. A new system creates an explicit spatial graph for each skeletal sequence and determines the positions of its components and the relative layout of the joints. To enhance discriminative power, temporal modeling employs integrated temporal convolutional layers. In addition, a global attention module adaptively highlights information-rich joints and time steps. Three public datasets were used in the experiments: NTU RGB+D 120, Kinetics Skeleton 400, and PKU-MMD. The model achieved a Top-1 accuracy of 93.1% on NTU RGB+D 120, surpassing many strong baselines. The macro F1 scores for subclasses range from 0.90 to 0.96, unaffected by action overlap or class imbalance. The model based on ablation experiments requires attention modules and spatial graph convolutions; their omission can lead to a performance drop of up to 3.8%. Cross-dataset transfer evaluation still achieves over 86% accuracy, demonstrating its generalizability. Moreover, the model is relatively stable to noise, maintaining an accuracy of 65.2% even after high-level disturbances, while other methods only achieve an accuracy of 24.7%. Based on the above results, the proposed method is feasible and practical, which means it can be used for human-computer interaction and other monitoring systems.