CLIP-Based Approach for Zero-Shot Visual Recognition in Industrial Assembly Scenarios

Authors

  • Sylwia Katarzyna Lisowska Faculty of Computing and Telecommunications, Poznan University of Technology, 61-138 Poznan, Poland

DOI:

https://doi.org/10.64972/dea.2026.v5i1.1624d:46-59

Keywords:

isual Recognition, Zero-Shot Learning, Industrial Assembly, Multi-Modal Embedding, Semantic Alignment, Contrastive Learning, Open-Set Recognition

Abstract

Due to the complexity and variability of the assembly process, automatic detection and classification of industrial parts have not yet been achieved. Previous supervised recognition methods are not suitable for dynamic production environments because they require a large amount of manual labeling and cannot be widely used in new categories. This paper introduces a zero-shot visual recognition framework based on Contrastive Language-Image Pretraining (CLIP) for industrial assembly applications. The aforementioned method creates a unified multimodal embedding space where technical component descriptions are aligned with image features. This allows new components to be identified without retraining. By using semantic alignment mechanisms, adaptive category prototypes, and domain-specific prompts, various text-based documents are connected with visual features. A large-scale industrial dataset containing over 60,000 labeled images has been created and tested under different lighting, equipment, and occlusion conditions at four production sites. The system's Top-1 accuracy is 86.7%, significantly higher than the transformer and convolution-based baselines, exceeding them by 4.3% and 6.7%, respectively. The Macro-F1 score is higher in medium-frequency and rare categories, and it remains stable in mobile deployment and production line environments. Ablation studies will also validate the effectiveness of the adaptive prompt module and context aggregation. Therefore, this scalable and practical framework is used for open set recognition and flexible quality control in high-end manufacturing.

Downloads

Published

2026-01-05

How to Cite

Lisowska, S. K. (2026). CLIP-Based Approach for Zero-Shot Visual Recognition in Industrial Assembly Scenarios. Data Engineering and Applications, 5(1), 4d:46–59. https://doi.org/10.64972/dea.2026.v5i1.1624d:46-59

Issue

Section

Articles