Occlusion-Robust Cross-Modal 3D Human Pose Estimation via Adaptive RGB-D Fusion and Hierarchical Graph Reasoning

İlknur Mağden; Gülcan Kahraman; Nuran Şahbaz

doi:10.64972/jaat.2024v2.254p10e:138-148

Authors

İlknur Mağden Faculty of Computer Science, Istanbul University, Istanbul, 34452, Turkey
Gülcan Kahraman Faculty of Computer Science, Istanbul University, Istanbul, 34452, Turkey
Nuran Şahbaz Faculty of Computer Science, Istanbul University, Istanbul, 34452, Turkey

DOI:

https://doi.org/10.64972/jaat.2024v2.254p10e:138-148

Keywords:

Cross-Modal Learning, 3D Pose Estimation, Graph Convolutional Networks, Occlusion Robustness

Abstract

Three-dimensional (3D) human pose estimation is crucial for intelligent systems involved in human–computer interaction, motion analysis, and clinical assessment, yet existing deep learning models struggle with occlusion, sensor noise, and perceptual limitations in single-modal settings. To address these issues, we propose a unified cross-modal 3D pose estimation framework that integrates RGB and depth data through a novel adaptive fusion module, selectively combining key features from both modalities at the intermediate stage to preserve complementary information. The fused features are represented as a skeleton map and processed by a hierarchical graph convolutional network, effectively capturing both local and global structural dependencies. Extensive experiments on the Human3.6M and MSRA datasets validate the effectiveness of our approach: the proposed method achieves a mean per-joint position error (MPJPE) of 27.6 mm on Human3.6M, substantially outperforming the single-modal baseline (34.1 mm) and previous cross-modal methods (30.8 mm), while the Percentage of Correct Keypoints (PCK) at 50 mm reaches 94.5%. These results demonstrate that the proposed framework significantly improves estimation accuracy and robustness under occlusion and noisy input, while maintaining high inference efficiency and manageable model complexity. Our findings highlight the importance of joint multimodal fusion and hierarchical structural reasoning for advancing robust, scalable 3D pose perception in unconstrained environments.