CNN-BiLSTM-Based Automatic Speech Recognition for Factory Noise Environments

Authors

  • Tomasz Andrzej Woźniak Faculty of Energy and Environmental Engineering, Silesian University of Technology, 44-100 Gliwice, Poland
  • Agata Barbara Pietrzak Faculty of Energy and Environmental Engineering, Silesian University of Technology, 44-100 Gliwice, Poland

Keywords:

Automatic Speech Recognition, Deep Learning, Industrial Noise, CNN-BiLSTM Architecture, Signal Processing, Edge Computing, Robustness, Factory Automation

Abstract

To address the issue of achieving stable voice command recognition in extremely noisy industrial environments, this paper proposes an optimized automatic speech recognition system (ASR) based on a CNN+BiLSTM structure. The system consistently maintains a word error rate of less than 23% even at extremely low signal-to-noise ratios (SNR < 5 dB). Furthermore, under the same acoustic conditions, the CNN-BiLSTM ASR system outperformed other competing transformer and hybrid HMM-DNN models by 7% and 14% respectively, and the sentence-level command accuracy improved by 10% in complex factory instructions. Furthermore, in the factory, the goal of this paper is to achieve stable speech command recognition amidst background noise fluctuations, sudden acoustic disturbances, and changes in operating conditions. For comprehensive temporal and spectral modeling, the proposed architecture integrates deep BiLSTM, multi-layer convolutional modules, and log-Mel feature extraction. Then, connect the temporal classification decoder. An experimental dataset containing over 1800 hours of speech transcriptions was used, which includes noise generated in both simulated and real-world environments. According to the above results, the CNN-BiLSTM ASR system consistently maintains a word error rate below 23% under extremely low signal-to-noise ratios (SNR < 5 dB). Moreover, under the same acoustic conditions, the CNN-BiLSTM ASR system outperformed other competitive transformer and hybrid HMM-DNN models by 7% and 14%, respectively. The accuracy of sentence-level commands improved by over 10% on complex factory instructions. Further analysis shows that compared to the baseline method, the system's deletion and substitution errors were reduced by up to 44% and 50%, respectively. Through evaluation on an industrial edge computing platform, the feasibility of real-time inference has been validated, with the real-time factor found to be close to 1.0. Based on the above findings, we constructed an ASR network based on CNN-BiLSTM. These networks exhibit high accuracy and stability during operation, and can also be used for quality inspection, speech control automation, and other applications.

Downloads

Published

2026-01-07

How to Cite

Woźniak, T. A., & Pietrzak, A. B. (2026). CNN-BiLSTM-Based Automatic Speech Recognition for Factory Noise Environments. Journal of Green Energy and Environmental Engineering, 4(1), 15–28. Retrieved from http://www.wpias.edu.pl/ojs/index.php/JGEEE/article/view/124

Issue

Section

Articles