Real-time Object Detection Model for Cigarette Brand Identification Based on an Improved Single-stage Regression Architecture

Jun Liu; Jianguang Yi; Hongli Deng; Wencan Li; Feng Yang; Xiaobo Zhao; Baobin Luo

doi:10.3791/69657

Research Article

Real-time Object Detection Model for Cigarette Brand Identification Based on an Improved Single-stage Regression Architecture

DOI:

10.3791/69657

⸱

January 9th, 2026

Jun Liu¹ , Jianguang Yi² , Hongli Deng³ , Wencan Li³ , Feng Yang⁴ , Xiaobo Zhao¹ , Baobin Luo⁵

¹Key Laboratory of Intelligent Manufacturing for Aerodynamic Equipment of Zhejiang Province, College of Mechanical Engineering, Quzhou University, ²College of Mechanical Engineering, Zhejiang University of Technology, ³Longyan Tobacco Industrial Co., Ltd., ⁴College of Civil Engineering and Architecture, Quzhou University, ⁵Baoji Cigarette Factory, China Tobacco Shaanxi Industrial Co., Ltd.

Summary

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

To address challenges like occlusion and lighting changes in automated warehouses, this paper introduces an improved Single-stage Regression Architecture for cigarette box brand detection. By integrating adaptive downsampling, an inverted efficient multi-scale attention mechanism, and a dynamic detection head, the proposed model achieves accurate, real-time Intelligent stocktaking.

Abstract

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Visual recognition for automated cigarette inventory faces significant hurdles, including illumination changes, diverse box dimensions, and partial feature occlusion, which complicate brand verification and misplaced box detection. This article proposes an improved real-time detection model to deal with image recognition problems and improve accuracy. Firstly, an adaptive downsampling module is deployed to replace the downsampling convolution module in both the backbone and neck networks of YOLO series as the baseline or original detector, which effectively retains more feature details and realizes the lightweight of the model. Secondly, an inverted efficient multi-scale attention module is introduced to capture the spatial context information of different scales and generate a more accurate spatial attention map, which improves the prediction accuracy of the baseline model for complex features and occlusion targets. Finally, a dynamic detection head module replaces theoriginal detection head of the baseline model and performs multi-scale object detection on the feature map extracted from the backbone and neck networks to achieve accurate positioning and category division of the predicted target. To evaluate the performance of the improved model in the field, we constructed a visual dataset of the cigarette box brand. The dataset was augmented using region-specific copy-paste and traditional augmentation techniques, and the obtained dataset includes complex background, occlusion, and overlap, small target, and other factors. The experiment demonstrates that the improved model presented in this article effectively meets the requirements for real-time detection in the field. The proposed model achieves a mAP of 97.9%, with parameters and FLOPs of 1,849,679 and 5.1 G, respectively. Compared with the baseline model, the proposed model improves mAP by 0.9% while reducing parameters by 28.78% and floating-point operations by 1.4 G. Additionally, the model reaches an inference speed of 38.5 FPS, satisfying the requirements for real-time industrial detection.

Introduction

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Modern manufacturing and logistics heavily rely on Automated Storage and Retrieval Systems (AS/RS)¹ to achieve high-density storage, efficient material handling, and accurate inventory management. AS/RS has become a crucial means that can help enterprises improve warehouse efficiency and reduce costs, owing to its high efficiency and intelligent characteristics. With a foundation of multi-layer shelves, various loading and unloading equipment, the AS/RS is a computer-controlled mechatronics system, which integrates multiple technologies including mechanics, electronics, computer science, communications, networking, sensors, and automatic control²^,³. The AS/RS realizes the efficient, accurate, and safe storage and handling of goods by the integration and optimization of shelves, stacker cranes, conveyor lines, control systems, and other equipment, which boosts the storage efficiency greatly. The integration of computer vision into AS/RS, a key aspect of Industry 4.0, enables unprecedented levels of automation and intelligence by allowing systems to see and make decisions based on visual data⁴.

With the widespread application of the AS/RS in tobacco factories⁵, a new stocktaking requirement for cigarette box brands has been proposed. Traditional manual stocktaking methods suffer from inefficiency, high false detection rates, and safety hazards, which fail to meet AS/RS needs. With the continuous advancement of computer vision technology, machine vision solutions are increasingly applied in industrial inspection domains⁶^,⁷^,⁸. Since brand information with unique patterns and text is printed on each cigarette box, machine vision-based image recognition technology enables cigarette box brand identification and stocktaking.

Since 2012, deep learning-based object detection algorithms have brought significant breakthroughs to image recognition⁹^,¹⁰^,¹¹. From early two-stage object detection models like R-CNN¹² to contemporary single-stage object detection models dominated by YOLO series models as the baseline or original detector¹³^,¹⁴^,¹⁵, these approaches have made a significant contribution to the development of object detection algorithms. Intelligent stocktaking tasks use object detection models more and more to identify and locate multiple targets accurately in images or videos. Li et al.¹⁶ proposed an intelligent stocktaking method integrating weighing sensors and image recognition technology to improve stocktaking efficiency and accuracy. Wang et al.¹⁷ used mask R-CNN to segment the bookshelf number and title position from the book spine image. Sun et al.¹⁸ designed an integrated visual recognition system, which used OpenCV to recognize the two-dimensional codes on materials and shelves in multi-mode. They optimized the original detector(v5s) by introducing the C3CBAM attention mechanism and SimOTA label assignment to enhance loss functions for accurate multi-material detection.

In the field of object detection, while two-stage models -- represented by Faster R-CNN -- possess traditional advantages in detection precision, their complex region proposal generation mechanisms result in relatively slow inference speeds. Consequently, they struggle to meet the stringent requirements for real-time performance in industrial scenarios¹⁹^,²⁰. In contrast, the regression-based architecture treats object detection as a single regression problem, directly predicting target locations and category probabilities from input images. This architectural design enables models to significantly outperform most two-stage models in terms of inference efficiency, lightweighting, and deployment flexibility, while maintaining competitive accuracy. Thus, this architecture has become the preferred solution for real-time industrial detection²¹.

The original model ecosystem continues to evolve rapidly, with recent variants addressing specific challenges. For instance, the original detector(v12)²² focuses on further enhancing training-time optimization and architectural refinement for better accuracy-efficiency trade-offs. Meanwhile, the original detector (World)²³ introduces open-vocabulary detection capabilities, enabling real-time inference of a vast array of objects beyond the training set categories by leveraging vision-language modeling. While these advancements push the boundaries of general-purpose object detection, this work focuses on a specialized industrial application where robustness to specific challenges, such as partial occlusion and illumination variance, is paramount, and the category space is fixed and known a priori. As the hottest version of this model family, the baseline model²⁴ inherits advantages from the original detector(v8)²⁵^,²⁶; it not only reduces the number of model parameters but also considers a balance between detection efficiency and accuracy. Compared to other object detection models, it stands out in visual recognition tasks.

The challenge of detecting small or partially occluded cigarette boxes is a manifestation of a broader problem in computer vision. Small object detection has been actively researched, with approaches ranging from feature pyramid network (FPN) refinements²⁷ to context-aware attention mechanisms, which inspire the integration of multi-scale feature processing. Furthermore, the pursuit of operational efficiency in an industrial setting necessitates a lightweight model design. Inspired by the efficient architecture concepts explored in MobileNetV3²⁸ and GhostNet²⁹, these concepts use deep convolution and linear transformation to reduce the computational complexity while maintaining the presentation ability, so we choose some lightweight modules to improve the model. Therefore, to deal with the stocktaking of cigarette box brands and the influencing factors in the stocktaking, such as illumination changes, feature size differences of various brands, and partial occlusion among the cigarette boxes, we propose an improved single-stage regression architecture object detection model in this article.

The principal innovations of the proposed model are threefold. First, the Adaptive Downsampling (ADown)³⁰module is deployed to replace the standard convolutional downsampling in both the backbone and neck networks. This novel design mitigates information loss during feature compression, which is critical for preserving small brand logos and text. Second, an inverted Efficient Multi-scale Attention (iEMA) mechanism is introduced to empower the model with dynamic, multi-scale contextual perception, significantly boosting its performance on small and partially occluded cigarette boxes. Third, the original detection head is replaced by the DynamicHead³¹ module, which unifies scale, spatial, and task-aware attentions, enabling more precise localization and classification across targets of vastly different sizes. These architectural modifications are cohesively designed to address the specific pain points of cigarette brand identification in industrial settings.

Access restricted. Please log in or start a trial to view this content.

Protocol

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The dataset used in this article was acquired from the AS/RS field of the Logistics Department of Longyan Tobacco Industry Co., Ltd. This study did not involve human participants or animals. Ethical approval was not required.

Dataset acquisition and setup
Hardware configuration: Mount an industrial camera (e.g., MV-CA013-A0GM) equipped with an 8 mm focal length lens (e.g., MVL-HF0828M-6MPE) onto the cargo platform of the stacker crane. Connect the camera to a control PC via a GigE cable and a wireless AP device (e.g., BH-ANT5158S-14HV, BH-MS-AC1600HWH). Position a strip LED light (e.g., MV-LLDS-1002-38-W) around the camera to ensure uniform illumination.

Camera parameter setting: Configure the camera using the manufacturer's software (e.g., MVS). Set the acquisition resolution to 1280 x 1024 pixels. Set the trigger mode to Hardware Trigger and synchronize it with the stacker crane's positioning system. Set the exposure time to 100 ms and the gain to 5 dB to minimize motion blur and noise. The working distance between the camera and the cigarette boxes is approximately 550 mm.

Image collection: An industrial camera with a trigger captures an image automatically each time a tray is positioned during the storage and retrieval of cigarette boxes. Collect a total of 4,835 raw images; typical images are shown in Figure 1.

Image annotation: Use the open-source tool LabelImg. For each image, draw bounding boxes around every visible cigarette brand feature. Assign the correct brand name (e.g., Hongzhuang, Gutian) as the class label for each bounding box. Save the annotations in a standard single-stage detection format, with one txt file per image.

Quality control: A second annotator reviews a randomly selected 20% of the annotated images. Any discrepancies are discussed and resolved, ensuring labeling consistency.

Data augmentation strategy
This section details both traditional and advanced augmentation techniques applied to the datasets. The initial data and augmented data are combined into a new dataset, which is then divided into training set, validation set, and test set to ensure fairness in evaluation.

Traditional data augmentation³²: These transformations are applied in real-time by the training pipeline (e.g., using the Albumentations or PyTorch Transforms libraries) with a defined probability for each batch.

Geometric transformations: Rotation: Randomly rotate images by an angle between -45 ° and +45°. Scaling: Randomly scale images by a factor between 0.8 and 1.2. Horizontal flip: Randomly flip images horizontally with a 100% probability. Random cropping: Randomly crop a section between 80% and 100% of the original image area.

Photometric transformations: Brightness and Contrast: Adjust brightness and contrast by a factor randomly chosen from [0.6, 1.4]. Gaussian noise: Add Gaussian noise with a standard deviation randomly chosen from [0, 0.01 * 255] to 10% of the images. Gaussian blur: Apply a Gaussian blur with a kernel size of 3x3 to 10% of the images.

Occlusion simulation: Randomly place 1 to 3 rectangular occlusion patches (covering up to 5% of the image area each) on the images. The patches are filled with random noise or mean image pixel values.

Advanced data augmentation: region-specific copy-paste
Motivation and impact: The primary motivation for this targeted augmentation was to address a critical data issue: several cigarette brands had extremely few instances (as low as one image). A standard random train-validation test split could potentially allocate all instances of a rare brand to a single set (e.g., all to the test set), resulting in the model having no opportunity to learn or validate that brand. This method artificially increased the sample size for these underrepresented brands, ensuring that every brand has a sufficient number of instances distributed across the training, validation, and test sets. This foundational step prevents catastrophic failure in rare categories and is a prerequisite for any meaningful model performance and generalization on the complete set of brands.

This step requires a pre-trained segmental baseline model on the initial dataset and the Segment Anything Model (SAM)³³.

Segment source objects: For cigarette box images from underrepresented brands, use the SAM model to generate precise segmentation masks. Use the point prompt mode, clicking on the Brand Feature of the Cigarette Box to initiate segmentation. Manually validate and refine generated masks to ensure accuracy. Save the segmented cigarette box images with transparent backgrounds as PNG files. This creates a library of isolated object instances.

Generate background masks: Use the pre-trained segmental baseline model to perform instance segmentation on a set of background images (images with ample free space or simple backgrounds). The model will output binary masks indicating the regions already occupied by objects.

Process masks and paste objects: For each background mask, use the OpenCV library ('cv2.approxPolyDP' function with an epsilon factor of 0.02 * contour perimeter) to perform curve fitting and linearize the mask edges, obtaining a simplified polygon representation of the free space. Identify the largest contiguous polygon area within the background mask as the target paste region. Randomly select a segmented source object from the library. Resize it (maintaining aspect ratio) and apply an affine transformation (minor rotation between -5 ° and +5°, and slight shear) to fit naturally within the target paste region, ensuring it does not overlap with the boundaries of existing objects. Use alpha blending to seamlessly paste the transformed object onto the background image at the calculated location. The overall framework of data augmentation technology based on the copy-paste technique in specific areas is shown in Figure 2. Programmatically generate a corresponding new txt annotation file for the pasted object, updating the bounding box coordinates and class label.

Final augmented dataset: This offline process generates 600 new synthetic images. Combine them with the original 4,835 images and an additional 1,167 images generated by applying the traditional augmentation techniques as described above to form the final dataset of 6,602 images. Split the final dataset into training (70%, 4,621 images), validation (20%, 1,320 images), and test (10%, 661 images) sets, ensuring that images from the same original sequence are kept within the same split to prevent data leakage.

Model modification for constructing the proposed improved detector
Optimized algorithms of object detection³⁴ have achieved notable progress in industrial inspection in recent years. This section provides a detailed, step-by-step procedure for modifying the baseline model architecture to create the proposed model. The modifications are implemented by editing the model's configuration file (e.g., config.yaml) and ensuring that corresponding custom module definitions are included in the codebase. The rationale for each modification is provided to elucidate its role in addressing specific detection challenges. The structure of the proposed model is shown in Figure 3.

Replacement of downsampling modules with the ADown module: The standard Convolution-BatchNorm-SiLU (CBS) module used for downsampling employs a single strided convolution, which can act as an information bottleneck³⁵, potentially discarding fine-grained features crucial for recognizing small cigarette boxes and detailed brand logos. The ADown module is designed to alleviate this by implementing a dual-branch structure that captures and preserves a richer set of features during spatial resolution reduction. One branch prioritizes the retention of high-frequency local details, while the other captures a broader, context-rich overview. The fusion of these complementary features results in a more robust and informative representation for subsequent layers.

Action-implementation steps
Module definition: Ensure that a custom PyTorch module named ADown is defined within the project's model definition files. This module must contain two parallel branches. The first branch should consist of a two-dimensional convolution layer with a 3x3 kernel and a stride of 2. The second branch should sequentially consist of a 2x2 max-pooling layer (with stride 2) followed by a 1x1 convolution layer. The outputs of these two branches are to be concatenated along the channel dimension, and the resulting feature map is then passed through a final 1x1 convolution layer to integrate the channels and produce the output. The structure of the ADown module is shown in Figure 4.

Configuration file editing: Open the baseline model configuration file (config.yaml). Systematically identify every instance of the CBS module that is configured for downsampling (i.e., where the stride parameter is set to 2). Replace each of these CBS module entries with the new ADown module.

Design motivation: The design of ADown is motivated by the need to mitigate information loss in standard strided convolutions, which can be detrimental for small objects. Its dual-branch structure is inspired by the idea of parallel processing for capturing both high-frequency spatial details (via convolution) and low-frequency contextual information (via pooling), thereby providing a richer multi-scale feature representation for subsequent layers.

Integration of the iEMA module
Rationale and design principle: To enhance the model's capability to detect small targets and those that are partially obscured, it is essential to incorporate a mechanism that can effectively model multi-scale spatial context. The iEMA module achieves this by embedding an Efficient Multi-scale Attention (EMA)³⁶ component within an inverted residual block (iRMB)³⁷ structure. The iRMB provides a computationally efficient foundation using depthwise separable convolutions, while the EMA component explicitly captures cross-scale spatial interactions through a parallel multi-branch design. This integration allows the model to dynamically recalibrate feature weights, focusing attention on the most discriminative spatial regions across different scales, thereby improving recognition under occlusion and for small objects.

Action-Implementation steps
Module definition: Ensure that a custom PyTorch module named iEMA is defined. This module should first implement an inverted residual block. This block typically begins with a pointwise convolution for channel expansion, followed by a depthwise convolution (e.g., 3x3) for spatial feature extraction, and finally a pointwise convolution for channel projection. Immediately following this block, the EMA attention mechanism should be implemented. The EMA mechanism typically employs grouped convolutions and cross-dimensional interactions to efficiently generate a multi-scale spatial attention map without excessive computational overhead. The structure of the iEMA module is shown in Figure 5.

Configuration file editing: In the config.yaml configuration file, locate the sections defining the neck network, specifically the layers immediately following the C2f modules. At these strategic locations, insert a new layer that calls the iEMA module.

Design motivation: The iEMA module is founded on the principle of enhancing feature discriminability by integrating local feature extraction (via depth-wise convolution) with a lightweight yet effective global context modeling mechanism (EMA). This hybrid approach allows the model to adaptively focus on informative regions across different scales, which is crucial for recognizing partially visible brand logos and text.

Replacement of the detection head with the DynamicHead module
Rationale and design principle: The original detection head may not optimally handle the significant scale variation of cigarette boxes and the complex interplay between localization and classification tasks. The DynamicHead module addresses this by unifying three forms of attention into a coherent structure: scale-aware, spatial-aware, and task-aware attention. The scale-aware component dynamically fuses features from different levels of the feature pyramid, ensuring that objects of all sizes are represented effectively. The spatial-aware component employs a sparse sampling strategy to focus computational resources on the most informative regions within the feature maps. The task-aware component adapts the feature representation specifically for the distinct objectives of bounding box regression and category classification. This unified approach leads to more accurate and robust predictions.

Action-Implementation steps
Module definition: This is a complex module that encompasses the three attention mechanisms described by Equations (1) through (4). Its internal structure will include layers for computing scale-wise weights, performing deformable spatial attention, and applying task-specific feature transformations.

$Dynamic system analysis equation, $ W_i(F) $ formula, mathematical model concept.$ (1)

Mathematical function diagram with sigma notation, showcasing data analysis algorithm. (2)

Equations for statistical analysis; πs(F) formula; mathematical model representation. (3)

Financial equation, πc(F) = max[α^1(F)·Fc + β^1(F), α^2(F)·Fc + β^2(F)], for optimization analysis. (4)

Where W_L(F) denotes the final attention-refined feature map, obtained by sequentially applying the scale-aware π_L(F), spatial-aware π_S(F), and task-aware π_C(F) attention mechanisms to the input feature F. Specifically, in Eq. (2), π_L(F) computes a scale-wise attention weight by first averaging across spatial (S) and channel (C) dimensions, passing it through a linear function f(⋅) and a hard-sigmoid activation σ(⋅). In Eq. (3), π_S(F) performs sparse spatial attention by aggregating features from K sampled key points p_k, each with a learnable offset Δp_k to focus on discriminative regions and an importance scalar Δm_k. In Eq. (4), π_C(F) implements a task-aware attention by applying a linear transformation (governed by learned parameters (α1, β¹,α²,β²)) to each channel and selecting the maximum response, allowing the head to specialize for classification and regression tasks. The structure of the DynamicHead module is shown in Figure 6.

Configuration file editing: In the config.yaml file, find the section that defines the model's head, which originally contains the Detect module. Replace it with a new entry that calls the DynamicHead module.

Design motivation: The DynamicHead module is grounded in the observation that object detection heads should be adaptive to scale, spatial location, and task. Its unified attention framework, formalized in Eqs. (1-4), allows the model to dynamically recalibrate feature responses based on these three dimensions, leading to more robust predictions against scale variation and background clutter.

Model training
Ensure PyTorch 2.1.0, CUDA 12.1, and all dependencies are installed.

Training command
Before retraining, prepare the divided image data and the annotation data in a standard single-stage detection format. Create a .yaml file containing the image path and data category. According to the model improvement, the original detector .yaml file is replaced by the proposed model .yaml file. Write the train.py file, import the single-stage detector package, load the training model and data path, and set some training parameters, including imgze=640, epochs=100, batch=4, workers=0, device='0', optimizer='SGD', close_mosaic=10, etc.

Hyperparameters: The key parameters are: input image size (640 x 640), batch size (4), initial learning rate (0.01) with cosine annealing scheduler, SGD optimizer with momentum (0.937), and weight decay (0.0005). Mosaic augmentation is enabled by default.

Training monitoring: The training process will output metrics for each epoch. Monitor the curves for training/validation loss and mAP at 0.5 to ensure convergence and avoid overfitting. Training for 100 epochs is typically sufficient.

Model evaluation and deployment
Evaluation: After training, the best model is saved as runs/train/exp/weights/best.pt. Evaluate it on the val set using: bash python val.py --data cigarette_dataset.yaml --weights runs/train/exp/weights/best.pt. The script will output Precision, Recall, mAP at 0.5, etc. Confirm the mAP meets the required threshold (e.g., 97.9%).

Inference for verification: Run inference on sample images to visually verify detection results: bash python detect.py --weights runs/train/exp/weights/best.pt --source path/to/test/images --conf 0.5. By verifying the reasoning, the detection results of each image and the detection speed of the model can be obtained.

Deployment: For real-time deployment on the stacker crane's system, convert the PyTorch model (best.pt, ~4.7 MB) to a format like TensorRT or ONNX for optimized inference speed. The final output is bounding boxes with class labels (brand names) and confidence scores.

Access restricted. Please log in or start a trial to view this content.

Results

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Experimental environment and parameter setting
Experiments to verify the performance of the proposed model were conducted on the deep learning framework PyTorch, and details of the experimental environment are listed in Table 1. Here, we used a special dataset containing 6,602 images that was randomly split into training, validation, and test sets at a ratio of 7:2:1. All experiments used input images with a resolution of 640 x 640 with an initial learning rate of 0.01, which was opt...

Access restricted. Please log in or start a trial to view this content.

Discussion

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Accuracy-efficiency trade-off
A central achievement of this model is its superior balance between accuracy and computational efficiency. As evidenced in Table 2, the presented model achieves the highest mAP while simultaneously having the lowest number of parameters and the second-lowest FLOPs among the single-stage detector models compared. This translates directly to practical benefits: a smaller model is faster to deploy, requires less memory, and has lower inference latency and...

Access restricted. Please log in or start a trial to view this content.

Disclosures

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The authors declare no direct competing financial interests. This research was conducted in collaboration with Longyan Tobacco Industrial Co., Ltd., where authors Hongli Deng and Wencan Li are employed, and Baoji Cigarette Factory, where author Baobin Luo is employed. These affiliations provided the application scenario and field data for the study. The academic authors (Jun Liu, Jianguang Yi, Feng Yang, Xiaobo Zhao) declare no financial or non-financial conflicts of interest regarding the publication of this work.

Acknowledgements

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

This work was financially supported by the Temperature Measuring Method of Casting Billet Based on Preceded Reflector and Multi-wavelength (No.LZY24E050002) funded by the Joint Funds of the Zhejiang Provincial Natural Science Foundation of China, and the Online Temperature Field Measuring Method of Non-closed and Non-isothermal Ladle Cavity (No. 2023K231) funded by the Quzhou Science and Technology Planning Project.

Access restricted. Please log in or start a trial to view this content.

Materials

List of materials used in this article
Name	Company	Catalog Number	Comments
code reader	Hikvision	ID5050	Hikvision equipment: Used to read barcodes on pallets, Need to connect 24V power supply
industrial camera	Hikvision	MV-CA013-A0GM, MV-CS016-10GM	Need to connect 24V power supply
LED light	HIKROBOT	MV-LLDS-1002-38-W	A one meter strip light source provides sufficient lighting conditions for the camera
lens	Hikvision	MVL-HF0828M-6MPE	focal length:8mm, Aperture range:F2.8~F16, pixel:6 million
MVS	Hikvision	V4.5.1	Hikvision software: Used for debugging cameras, setting camera parameters, and data collection
Pycharm	JetBrains	2020.1.3 x64	Used for training deep learning models and developing detection systems
Several metal brackets	Longyan Tobacco Industrial Co., Ltd.	7075-T6 Aluminum Alloy	Used for fixing cameras and code readers
Several network cables	Longyan Tobacco Industrial Co., Ltd.	RJ45 Crystal Head	Connect the base station between the industrial computer and wireless AP, connect the camera and switch, etc
Swicth	TP-Link	TL-SH1005	There are 8 Ethernet cable interfaces that require 220V power supply
Vision Master	Hikvision	V4.3.0	Hikvision software: Used for template matching and detecting image results
wireless AP device	Shandong Huachuangxunlian	BH-ANT5158S-14HV, BH-MS-AC1600HWH	Due to the large-scale movement required by the stacker crane, the network cable cannot connect the camera to the industrial computer, and a wireless AP device is needed to ensure the smooth operation of the camera network

References

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Chen, C., Liu, J., Yin, H., Huang, B. A Vision-Based Method for Detecting the Position of Stacked Goods in Automated Storage and Retrieval Systems. Sensors. 25 (10), 2623(2025).
Yu, X., Liao, X., Li, W., Liu, X., Tao, Z. Logistics automation control based on machine learning algorithm. Cluster Comput. 22 (Suppl 4), 14003-14011 (2019).
Liu, J., et al. Benders decomposition for the multi-agent location and scheduling problem on unrelated parallel machines. Soft Computing. 29 (1), 195-212 (2025).
Haffner, O., Kuˇcera, E., Rosinová, D. Applications of Machine Learning and Computer Vision in Industry 4.0. Appl. Sci. 14 (6), 2431(2024).
Bo, Q., et al. Formulation of receiving/retrieving strategy for automatic high rack finished cigarette warehouse. Tobacco Sci Technol. 57 (6), 92-98 (2024).
Voulodimos, A., Doulamis, N., Doulamis, A., Protopapadakis, A. Deep learning for computer vision: A brief review. Computat Intell Neurosci. 2018 (1), 7068349(2018).
Khan, A. I., Al-Habsi, S. Machine learning in computer vision. Proc Comp Sci. 167, 1444-1451 (2020).
Xu, S., et al. Computer vision techniques in construction: a critical review. Arch Computat Meth Eng. 28 (5), 3383-3397 (2021).
Xu, P. Progress of Object detection: Methods and future directions. Second IYSF Acad Sympos Artif Intell Comp Eng SPIE. 12079, 530-542 (2021).
Sreelakshmi, P. R., Swaraj, K. P. Occlusion resilient object detection under various situations using deep learning techniques: A review. AIP Conf Proc AIP Publishing LLC. 3037 (1), 020042(2024).
Rich feature hierarchies for accurate object detection and semantic segmentation. Girshick, R., Donahue, J., Darrell, T., Malik, J. Proc IEEE Conf Comp Vision Pattern Recognit, , 580-587 (2014).
Fast R-CNN. Girshick, R. Proc IEEE Int Conf Comp Vision, , 1440-1448 (2015).
You only look once: Unified, real-time object detection. Redmon, J., Divvala, S., Girshick, R., Farhadi, A. Proc IEEE Conf Comp Vision Pattern Recognit, , 779-788 (2016).
Hussain, M. YOLO-v1 to YOLO-v8, the rise of YOLO and its complementary nature toward digital manufacturing and industrial defect detection. Machines. 11 (7), 677(2023).
Li, C., et al. YOLOv6: A single-stage object detection framework for industrial applications. arxiv. , (2022).
Li, D., Liu, Y., Hu, C., Wang, Y., Wen, X. Research and application of intelligent stocktaking method based on weighing and image recognition technology in publishing industry. Logistics Technol Applicat. 30 (1), 134-142 (2025).
Wang, X., Qian, S., Zhang, J., Guo, J., Pan, C. Exploration and application of library book stocktaking system based on computer vision and artificial intelligence technology. Library J. 41 (7), 96(2022).
Sun, H., Li, P. Design and development of intelligent warehouse stocktaking system based on multi pattern visual recognition technology. Construct Machinery Equip. 56 (4), 107-111 (2025).
Ren, S., He, K., Girshick, R., Jian, S. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transact Pattern Anal Machine Intell. 39 (6), 1137-1149 (2016).
Speed/Accuracy Trade-offs for Modern Convolutional Object Detectors. Huang, J., et al. Proc IEEE Conf Comput Vis Pattern Recognit. (CVPR), , 7310-7311 (2017).
Bochkovskiy, A., Wang, C. Y., Liao, H. Y. M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv. , (2020).
Tian, Y., Ye, Q., Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv. , (2025).
Cheng, T., et al. YOLO-World: Real-Time Open-Vocabulary Object Detection. arXiv. , (2024).
Khanam, R., Hussain, M. Yolov11: An overview of the key architectural enhancements. Arxiv. , (2024).
YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. Varghese, R., Sambath, M. 2024 Int Conf Adv Data Eng Intell Comput Sys (ADICS), , IEEE. 1-6 (2024).
Wan, D., et al. YOLO-MIF: Improved YOLOv8 with Multi-Information fusion for object detection in Gray-Scale images. Adv Eng Info. 62, 102709(2024).
Feature Pyramid Networks for Object Detection. Lin, T. Y., et al. Proc IEEE Conf Comp Vision Pattern Recognit (CVPR), , 2117-2125 (2017).
Searching for MobileNetV3. Proc IEEE/CVF Int Conf Comp Vision (ICCV). Howard, A., et al. , 1314-1324 (2019).
GhostNet: More Features from Cheap Operations. Han, K., et al. Proc IEEE/CVF Conf Comp Vision Pattern Recognit (CVPR), , 1580-1589 (2020).
Yolov9: Learning what you want to learn using programmable gradient information. Wang, C. Y., Yeh, I. H., Mark Liao, H. Y. European conference on computer vision, , Cham Springer Nat Switzerland. 1-21 (2024).
Dynamic head: Unifying object detection heads with attentions. Dai, X., et al. Proc IEEE/CVF Conf Comp Vision Pattern Recognit, , 7373-7382 (2021).
Zhu, X., et al. Overview of Research on Image Data Enhancement Technology. Soft Guide. 20 (5), 1-5 (2021).
Segment anything. Kirillov, A., et al. Proc IEEE/CVF Int Conf Comp Vision, , 4015-4026 (2023).
Jiang, H., Wang, Y., Kang, J. Overview of object detection models and optimization methods. J Automatica Sinica. 47 (6), 1367-1393 (2021).
Deep Residual Learning for Image Recognition. He, K., Zhang, X., Ren, S., Sun, J. Proc IEEE Conf Comput Vis Pattern Recognit (CVPR), , 770-778 (2016).
Efficient multi-scale attention module with cross-spatial learning. ICASSP 2023-2023 IEEE Int Conf Acoustics Speech Signal Proc (ICASSP). Ouyang, D., et al. , IEEE. 1-5 (2023).
Rethinking mobile block for efficient attention-based models. Zhang, J., et al. 2023 IEEE/CVF Int Conf Computer Vis (ICCV) IEEE Comp Soc, , 1389-1400 (2023).
Wang, Y., et al. Multi-Type Ship Target Detection in Complex Marine Background Based on YOLOv11. Processes. 13 (1), 249(2025).
Shao, X., et al. Multi-Scale Feature Pyramid Network: A Heavily Occluded Pedestrian Detection Network Based on ResNet. Sensors. 21 (5), 1820(2021).
YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. Wang, C. Y., Bochkovskiy, A., Liao, H. Y. M. Proc IEEE/CVF Conf Comput Vis Pattern Recognit (CVPR), , 7464-7475 (2023).

Access restricted. Please log in or start a trial to view this content.

Reprints and Permissions

Request permission to reuse the text or figures of this JoVE article

Request Permission

Real-time Object Detection Model for Cigarette Brand Identification Based on an Improved Single-stage Regression Architecture

In This Article

Summary

Abstract

Introduction

Protocol

Results

Discussion

Disclosures

Acknowledgements

Materials

References

Reprints and Permissions

Tags

Related Articles