RESEARCH
Peer reviewed scientific video journal
Video encyclopedia of advanced research methods
Visualizing science through experiment videos
EDUCATION
Video textbooks for undergraduate courses
Visual demonstrations of key scientific experiments
BUSINESS
Video textbooks for business education
OTHERS
Interactive video based quizzes for formative assessments
Products
RESEARCH
JoVE Journal
Peer reviewed scientific video journal
JoVE Encyclopedia of Experiments
Video encyclopedia of advanced research methods
EDUCATION
JoVE Core
Video textbooks for undergraduates
JoVE Science Education
Visual demonstrations of key scientific experiments
JoVE Lab Manual
Videos of experiments for undergraduate lab courses
BUSINESS
JoVE Business
Video textbooks for business education
Solutions
Language
English
Menu
Menu
Menu
Menu
A subscription to JoVE is required to view this content. Sign in or start your free trial.
Research Article
Jun Liu1, Jianguang Yi2, Hongli Deng3, Wencan Li3, Feng Yang4, Xiaobo Zhao1, Baobin Luo5
1Key Laboratory of Intelligent Manufacturing for Aerodynamic Equipment of Zhejiang Province, College of Mechanical Engineering,Quzhou University, 2College of Mechanical Engineering,Zhejiang University of Technology, 3Longyan Tobacco Industrial Co., Ltd., 4College of Civil Engineering and Architecture,Quzhou University, 5Baoji Cigarette Factory,China Tobacco Shaanxi Industrial Co., Ltd.
Erratum Notice
Important: There has been an erratum issued for this article. View Erratum Notice
Retraction Notice
The article Assisted Selection of Biomarkers by Linear Discriminant Analysis Effect Size (LEfSe) in Microbiome Data (10.3791/61715) has been retracted by the journal upon the authors' request due to a conflict regarding the data and methodology. View Retraction Notice
To address challenges like occlusion and lighting changes in automated warehouses, this paper introduces an improved Single-stage Regression Architecture for cigarette box brand detection. By integrating adaptive downsampling, an inverted efficient multi-scale attention mechanism, and a dynamic detection head, the proposed model achieves accurate, real-time Intelligent stocktaking.
Visual recognition for automated cigarette inventory faces significant hurdles, including illumination changes, diverse box dimensions, and partial feature occlusion, which complicate brand verification and misplaced box detection. This article proposes an improved real-time detection model to deal with image recognition problems and improve accuracy. Firstly, an adaptive downsampling module is deployed to replace the downsampling convolution module in both the backbone and neck networks of YOLO series as the baseline or original detector, which effectively retains more feature details and realizes the lightweight of the model. Secondly, an inverted efficient multi-scale attention module is introduced to capture the spatial context information of different scales and generate a more accurate spatial attention map, which improves the prediction accuracy of the baseline model for complex features and occlusion targets. Finally, a dynamic detection head module replaces theoriginal detection head of the baseline model and performs multi-scale object detection on the feature map extracted from the backbone and neck networks to achieve accurate positioning and category division of the predicted target. To evaluate the performance of the improved model in the field, we constructed a visual dataset of the cigarette box brand. The dataset was augmented using region-specific copy-paste and traditional augmentation techniques, and the obtained dataset includes complex background, occlusion, and overlap, small target, and other factors. The experiment demonstrates that the improved model presented in this article effectively meets the requirements for real-time detection in the field. The proposed model achieves a mAP of 97.9%, with parameters and FLOPs of 1,849,679 and 5.1 G, respectively. Compared with the baseline model, the proposed model improves mAP by 0.9% while reducing parameters by 28.78% and floating-point operations by 1.4 G. Additionally, the model reaches an inference speed of 38.5 FPS, satisfying the requirements for real-time industrial detection.
Modern manufacturing and logistics heavily rely on Automated Storage and Retrieval Systems (AS/RS)1 to achieve high-density storage, efficient material handling, and accurate inventory management. AS/RS has become a crucial means that can help enterprises improve warehouse efficiency and reduce costs, owing to its high efficiency and intelligent characteristics. With a foundation of multi-layer shelves, various loading and unloading equipment, the AS/RS is a computer-controlled mechatronics system, which integrates multiple technologies including mechanics, electronics, computer science, communications, networking, sensors, and automatic control2,3. The AS/RS realizes the efficient, accurate, and safe storage and handling of goods by the integration and optimization of shelves, stacker cranes, conveyor lines, control systems, and other equipment, which boosts the storage efficiency greatly. The integration of computer vision into AS/RS, a key aspect of Industry 4.0, enables unprecedented levels of automation and intelligence by allowing systems to see and make decisions based on visual data4.
With the widespread application of the AS/RS in tobacco factories5, a new stocktaking requirement for cigarette box brands has been proposed. Traditional manual stocktaking methods suffer from inefficiency, high false detection rates, and safety hazards, which fail to meet AS/RS needs. With the continuous advancement of computer vision technology, machine vision solutions are increasingly applied in industrial inspection domains6,7,8. Since brand information with unique patterns and text is printed on each cigarette box, machine vision-based image recognition technology enables cigarette box brand identification and stocktaking.
Since 2012, deep learning-based object detection algorithms have brought significant breakthroughs to image recognition9,10,11. From early two-stage object detection models like R-CNN12 to contemporary single-stage object detection models dominated by YOLO series models as the baseline or original detector13,14,15, these approaches have made a significant contribution to the development of object detection algorithms. Intelligent stocktaking tasks use object detection models more and more to identify and locate multiple targets accurately in images or videos. Li et al.16 proposed an intelligent stocktaking method integrating weighing sensors and image recognition technology to improve stocktaking efficiency and accuracy. Wang et al.17 used mask R-CNN to segment the bookshelf number and title position from the book spine image. Sun et al.18 designed an integrated visual recognition system, which used OpenCV to recognize the two-dimensional codes on materials and shelves in multi-mode. They optimized the original detector(v5s) by introducing the C3CBAM attention mechanism and SimOTA label assignment to enhance loss functions for accurate multi-material detection.
In the field of object detection, while two-stage models -- represented by Faster R-CNN -- possess traditional advantages in detection precision, their complex region proposal generation mechanisms result in relatively slow inference speeds. Consequently, they struggle to meet the stringent requirements for real-time performance in industrial scenarios19,20. In contrast, the regression-based architecture treats object detection as a single regression problem, directly predicting target locations and category probabilities from input images. This architectural design enables models to significantly outperform most two-stage models in terms of inference efficiency, lightweighting, and deployment flexibility, while maintaining competitive accuracy. Thus, this architecture has become the preferred solution for real-time industrial detection21.
The original model ecosystem continues to evolve rapidly, with recent variants addressing specific challenges. For instance, the original detector(v12)22 focuses on further enhancing training-time optimization and architectural refinement for better accuracy-efficiency trade-offs. Meanwhile, the original detector (World)23 introduces open-vocabulary detection capabilities, enabling real-time inference of a vast array of objects beyond the training set categories by leveraging vision-language modeling. While these advancements push the boundaries of general-purpose object detection, this work focuses on a specialized industrial application where robustness to specific challenges, such as partial occlusion and illumination variance, is paramount, and the category space is fixed and known a priori. As the hottest version of this model family, the baseline model24 inherits advantages from the original detector(v8)25,26; it not only reduces the number of model parameters but also considers a balance between detection efficiency and accuracy. Compared to other object detection models, it stands out in visual recognition tasks.
The challenge of detecting small or partially occluded cigarette boxes is a manifestation of a broader problem in computer vision. Small object detection has been actively researched, with approaches ranging from feature pyramid network (FPN) refinements27 to context-aware attention mechanisms, which inspire the integration of multi-scale feature processing. Furthermore, the pursuit of operational efficiency in an industrial setting necessitates a lightweight model design. Inspired by the efficient architecture concepts explored in MobileNetV328 and GhostNet29, these concepts use deep convolution and linear transformation to reduce the computational complexity while maintaining the presentation ability, so we choose some lightweight modules to improve the model. Therefore, to deal with the stocktaking of cigarette box brands and the influencing factors in the stocktaking, such as illumination changes, feature size differences of various brands, and partial occlusion among the cigarette boxes, we propose an improved single-stage regression architecture object detection model in this article.
The principal innovations of the proposed model are threefold. First, the Adaptive Downsampling (ADown)30module is deployed to replace the standard convolutional downsampling in both the backbone and neck networks. This novel design mitigates information loss during feature compression, which is critical for preserving small brand logos and text. Second, an inverted Efficient Multi-scale Attention (iEMA) mechanism is introduced to empower the model with dynamic, multi-scale contextual perception, significantly boosting its performance on small and partially occluded cigarette boxes. Third, the original detection head is replaced by the DynamicHead31 module, which unifies scale, spatial, and task-aware attentions, enabling more precise localization and classification across targets of vastly different sizes. These architectural modifications are cohesively designed to address the specific pain points of cigarette brand identification in industrial settings.
The dataset used in this article was acquired from the AS/RS field of the Logistics Department of Longyan Tobacco Industry Co., Ltd. This study did not involve human participants or animals. Ethical approval was not required.
Dataset acquisition and setup
Hardware configuration: Mount an industrial camera (e.g., MV-CA013-A0GM) equipped with an 8 mm focal length lens (e.g., MVL-HF0828M-6MPE) onto the cargo platform of the stacker crane. Connect the camera to a control PC via a GigE cable and a wireless AP device (e.g., BH-ANT5158S-14HV, BH-MS-AC1600HWH). Position a strip LED light (e.g., MV-LLDS-1002-38-W) around the camera to ensure uniform illumination.
Camera parameter setting: Configure the camera using the manufacturer's software (e.g., MVS). Set the acquisition resolution to 1280 x 1024 pixels. Set the trigger mode to Hardware Trigger and synchronize it with the stacker crane's positioning system. Set the exposure time to 100 ms and the gain to 5 dB to minimize motion blur and noise. The working distance between the camera and the cigarette boxes is approximately 550 mm.
Image collection: An industrial camera with a trigger captures an image automatically each time a tray is positioned during the storage and retrieval of cigarette boxes. Collect a total of 4,835 raw images; typical images are shown in Figure 1.
Image annotation: Use the open-source tool LabelImg. For each image, draw bounding boxes around every visible cigarette brand feature. Assign the correct brand name (e.g., Hongzhuang, Gutian) as the class label for each bounding box. Save the annotations in a standard single-stage detection format, with one txt file per image.
Quality control: A second annotator reviews a randomly selected 20% of the annotated images. Any discrepancies are discussed and resolved, ensuring labeling consistency.
Data augmentation strategy
This section details both traditional and advanced augmentation techniques applied to the datasets. The initial data and augmented data are combined into a new dataset, which is then divided into training set, validation set, and test set to ensure fairness in evaluation.
Traditional data augmentation32: These transformations are applied in real-time by the training pipeline (e.g., using the Albumentations or PyTorch Transforms libraries) with a defined probability for each batch.
Geometric transformations: Rotation: Randomly rotate images by an angle between -45 ° and +45°. Scaling: Randomly scale images by a factor between 0.8 and 1.2. Horizontal flip: Randomly flip images horizontally with a 100% probability. Random cropping: Randomly crop a section between 80% and 100% of the original image area.
Photometric transformations: Brightness and Contrast: Adjust brightness and contrast by a factor randomly chosen from [0.6, 1.4]. Gaussian noise: Add Gaussian noise with a standard deviation randomly chosen from [0, 0.01 * 255] to 10% of the images. Gaussian blur: Apply a Gaussian blur with a kernel size of 3x3 to 10% of the images.
Occlusion simulation: Randomly place 1 to 3 rectangular occlusion patches (covering up to 5% of the image area each) on the images. The patches are filled with random noise or mean image pixel values.
Advanced data augmentation: region-specific copy-paste
Motivation and impact: The primary motivation for this targeted augmentation was to address a critical data issue: several cigarette brands had extremely few instances (as low as one image). A standard random train-validation test split could potentially allocate all instances of a rare brand to a single set (e.g., all to the test set), resulting in the model having no opportunity to learn or validate that brand. This method artificially increased the sample size for these underrepresented brands, ensuring that every brand has a sufficient number of instances distributed across the training, validation, and test sets. This foundational step prevents catastrophic failure in rare categories and is a prerequisite for any meaningful model performance and generalization on the complete set of brands.
This step requires a pre-trained segmental baseline model on the initial dataset and the Segment Anything Model (SAM)33.
Segment source objects: For cigarette box images from underrepresented brands, use the SAM model to generate precise segmentation masks. Use the point prompt mode, clicking on the Brand Feature of the Cigarette Box to initiate segmentation. Manually validate and refine generated masks to ensure accuracy. Save the segmented cigarette box images with transparent backgrounds as PNG files. This creates a library of isolated object instances.
Generate background masks: Use the pre-trained segmental baseline model to perform instance segmentation on a set of background images (images with ample free space or simple backgrounds). The model will output binary masks indicating the regions already occupied by objects.
Process masks and paste objects: For each background mask, use the OpenCV library ('cv2.approxPolyDP' function with an epsilon factor of 0.02 * contour perimeter) to perform curve fitting and linearize the mask edges, obtaining a simplified polygon representation of the free space. Identify the largest contiguous polygon area within the background mask as the target paste region. Randomly select a segmented source object from the library. Resize it (maintaining aspect ratio) and apply an affine transformation (minor rotation between -5 ° and +5°, and slight shear) to fit naturally within the target paste region, ensuring it does not overlap with the boundaries of existing objects. Use alpha blending to seamlessly paste the transformed object onto the background image at the calculated location. The overall framework of data augmentation technology based on the copy-paste technique in specific areas is shown in Figure 2. Programmatically generate a corresponding new txt annotation file for the pasted object, updating the bounding box coordinates and class label.
Final augmented dataset: This offline process generates 600 new synthetic images. Combine them with the original 4,835 images and an additional 1,167 images generated by applying the traditional augmentation techniques as described above to form the final dataset of 6,602 images. Split the final dataset into training (70%, 4,621 images), validation (20%, 1,320 images), and test (10%, 661 images) sets, ensuring that images from the same original sequence are kept within the same split to prevent data leakage.
Model modification for constructing the proposed improved detector
Optimized algorithms of object detection34 have achieved notable progress in industrial inspection in recent years. This section provides a detailed, step-by-step procedure for modifying the baseline model architecture to create the proposed model. The modifications are implemented by editing the model's configuration file (e.g., config.yaml) and ensuring that corresponding custom module definitions are included in the codebase. The rationale for each modification is provided to elucidate its role in addressing specific detection challenges. The structure of the proposed model is shown in Figure 3.
Replacement of downsampling modules with the ADown module: The standard Convolution-BatchNorm-SiLU (CBS) module used for downsampling employs a single strided convolution, which can act as an information bottleneck35, potentially discarding fine-grained features crucial for recognizing small cigarette boxes and detailed brand logos. The ADown module is designed to alleviate this by implementing a dual-branch structure that captures and preserves a richer set of features during spatial resolution reduction. One branch prioritizes the retention of high-frequency local details, while the other captures a broader, context-rich overview. The fusion of these complementary features results in a more robust and informative representation for subsequent layers.
Action-implementation steps
Module definition: Ensure that a custom PyTorch module named ADown is defined within the project's model definition files. This module must contain two parallel branches. The first branch should consist of a two-dimensional convolution layer with a 3x3 kernel and a stride of 2. The second branch should sequentially consist of a 2x2 max-pooling layer (with stride 2) followed by a 1x1 convolution layer. The outputs of these two branches are to be concatenated along the channel dimension, and the resulting feature map is then passed through a final 1x1 convolution layer to integrate the channels and produce the output. The structure of the ADown module is shown in Figure 4.
Configuration file editing: Open the baseline model configuration file (config.yaml). Systematically identify every instance of the CBS module that is configured for downsampling (i.e., where the stride parameter is set to 2). Replace each of these CBS module entries with the new ADown module.
Design motivation: The design of ADown is motivated by the need to mitigate information loss in standard strided convolutions, which can be detrimental for small objects. Its dual-branch structure is inspired by the idea of parallel processing for capturing both high-frequency spatial details (via convolution) and low-frequency contextual information (via pooling), thereby providing a richer multi-scale feature representation for subsequent layers.
Integration of the iEMA module
Rationale and design principle: To enhance the model's capability to detect small targets and those that are partially obscured, it is essential to incorporate a mechanism that can effectively model multi-scale spatial context. The iEMA module achieves this by embedding an Efficient Multi-scale Attention (EMA)36 component within an inverted residual block (iRMB)37 structure. The iRMB provides a computationally efficient foundation using depthwise separable convolutions, while the EMA component explicitly captures cross-scale spatial interactions through a parallel multi-branch design. This integration allows the model to dynamically recalibrate feature weights, focusing attention on the most discriminative spatial regions across different scales, thereby improving recognition under occlusion and for small objects.
Action-Implementation steps
Module definition: Ensure that a custom PyTorch module named iEMA is defined. This module should first implement an inverted residual block. This block typically begins with a pointwise convolution for channel expansion, followed by a depthwise convolution (e.g., 3x3) for spatial feature extraction, and finally a pointwise convolution for channel projection. Immediately following this block, the EMA attention mechanism should be implemented. The EMA mechanism typically employs grouped convolutions and cross-dimensional interactions to efficiently generate a multi-scale spatial attention map without excessive computational overhead. The structure of the iEMA module is shown in Figure 5.
Configuration file editing: In the config.yaml configuration file, locate the sections defining the neck network, specifically the layers immediately following the C2f modules. At these strategic locations, insert a new layer that calls the iEMA module.
Design motivation: The iEMA module is founded on the principle of enhancing feature discriminability by integrating local feature extraction (via depth-wise convolution) with a lightweight yet effective global context modeling mechanism (EMA). This hybrid approach allows the model to adaptively focus on informative regions across different scales, which is crucial for recognizing partially visible brand logos and text.
Replacement of the detection head with the DynamicHead module
Rationale and design principle: The original detection head may not optimally handle the significant scale variation of cigarette boxes and the complex interplay between localization and classification tasks. The DynamicHead module addresses this by unifying three forms of attention into a coherent structure: scale-aware, spatial-aware, and task-aware attention. The scale-aware component dynamically fuses features from different levels of the feature pyramid, ensuring that objects of all sizes are represented effectively. The spatial-aware component employs a sparse sampling strategy to focus computational resources on the most informative regions within the feature maps. The task-aware component adapts the feature representation specifically for the distinct objectives of bounding box regression and category classification. This unified approach leads to more accurate and robust predictions.
Action-Implementation steps
Module definition: This is a complex module that encompasses the three attention mechanisms described by Equations (1) through (4). Its internal structure will include layers for computing scale-wise weights, performing deformable spatial attention, and applying task-specific feature transformations.
(1)
(2)
(3)
(4)
Where WL(F) denotes the final attention-refined feature map, obtained by sequentially applying the scale-aware πL(F), spatial-aware πS(F), and task-aware πC(F) attention mechanisms to the input feature F. Specifically, in Eq. (2), πL(F) computes a scale-wise attention weight by first averaging across spatial (S) and channel (C) dimensions, passing it through a linear function f(⋅) and a hard-sigmoid activation σ(⋅). In Eq. (3), πS(F) performs sparse spatial attention by aggregating features from K sampled key points pk, each with a learnable offset Δpk to focus on discriminative regions and an importance scalar Δmk. In Eq. (4), πC(F) implements a task-aware attention by applying a linear transformation (governed by learned parameters (α1, β1,α2,β2)) to each channel and selecting the maximum response, allowing the head to specialize for classification and regression tasks. The structure of the DynamicHead module is shown in Figure 6.
Configuration file editing: In the config.yaml file, find the section that defines the model's head, which originally contains the Detect module. Replace it with a new entry that calls the DynamicHead module.
Design motivation: The DynamicHead module is grounded in the observation that object detection heads should be adaptive to scale, spatial location, and task. Its unified attention framework, formalized in Eqs. (1-4), allows the model to dynamically recalibrate feature responses based on these three dimensions, leading to more robust predictions against scale variation and background clutter.
Model training
Ensure PyTorch 2.1.0, CUDA 12.1, and all dependencies are installed.
Training command
Before retraining, prepare the divided image data and the annotation data in a standard single-stage detection format. Create a .yaml file containing the image path and data category. According to the model improvement, the original detector .yaml file is replaced by the proposed model .yaml file. Write the train.py file, import the single-stage detector package, load the training model and data path, and set some training parameters, including imgze=640, epochs=100, batch=4, workers=0, device='0', optimizer='SGD', close_mosaic=10, etc.
Hyperparameters: The key parameters are: input image size (640 x 640), batch size (4), initial learning rate (0.01) with cosine annealing scheduler, SGD optimizer with momentum (0.937), and weight decay (0.0005). Mosaic augmentation is enabled by default.
Training monitoring: The training process will output metrics for each epoch. Monitor the curves for training/validation loss and mAP at 0.5 to ensure convergence and avoid overfitting. Training for 100 epochs is typically sufficient.
Model evaluation and deployment
Evaluation: After training, the best model is saved as runs/train/exp/weights/best.pt. Evaluate it on the val set using: bash python val.py --data cigarette_dataset.yaml --weights runs/train/exp/weights/best.pt. The script will output Precision, Recall, mAP at 0.5, etc. Confirm the mAP meets the required threshold (e.g., 97.9%).
Inference for verification: Run inference on sample images to visually verify detection results: bash python detect.py --weights runs/train/exp/weights/best.pt --source path/to/test/images --conf 0.5. By verifying the reasoning, the detection results of each image and the detection speed of the model can be obtained.
Deployment: For real-time deployment on the stacker crane's system, convert the PyTorch model (best.pt, ~4.7 MB) to a format like TensorRT or ONNX for optimized inference speed. The final output is bounding boxes with class labels (brand names) and confidence scores.
Experimental environment and parameter setting
Experiments to verify the performance of the proposed model were conducted on the deep learning framework PyTorch, and details of the experimental environment are listed in Table 1. Here, we used a special dataset containing 6,602 images that was randomly split into training, validation, and test sets at a ratio of 7:2:1. All experiments used input images with a resolution of 640 x 640 with an initial learning rate of 0.01, which was optimized by stochastic gradient descent with a momentum of 0.937 to prevent overfitting. During training, mosaic augmentation was used to process four images in each iteration to diversify the backgrounds for detection. All models were trained for 100 epochs, with a batch size of 4, while the number of worker threads was set to 0.
Evaluation indicators
The samples of interest for multi-class classification were designated as the positive class, while all other samples constituted the negative class. The true positive (TP) represents the number of samples correctly predicted as positive, the false positive (FP) represents those that were incorrectly predicted as positive, the false negative (FN) denotes samples that were incorrectly predicted as negative, while the true negative (TN) represents samples that were correctly predicted as negative. By building upon this foundation, a variety of commonly used indicators of evaluation can be derived to comprehensively evaluate the performance of the model on tasks of object detection. This article evaluates the performance of the model using precision, recall, average precision (AP), mean average precision (mAP), and recognition speed (FPS)38. The following are the calculation formulas for these indicators:
(5)
(6)
(7)
(8)
Where Pi and Ri are the precision and recall values at threshold i, N is the number of categories, and APi is the AP of the i-th category.
Data augmentation results
The region-specific copy-paste augmentation was first and foremost employed to ensure the structural integrity of the dataset. Before augmentation, the 5 rarest brands (e.g., qipilang(lan), jiangjun) had fewer than 5 images each, posing a high risk of being absent from the training set after splitting. The region-specific copy-paste augmentation successfully generated 600 high-quality synthetic images; each of these brands was represented by over 60 instances, guaranteeing their presence across all data splits. Qualitative inspection confirmed that the pasted objects were seamless and contextually appropriate (Figure 2). This process balanced the dataset, which was confirmed to be crucial for improving the recall of underrepresented brands by approximately 15% in subsequent experiments.
To quantitatively evaluate the fidelity of the synthetic images, we computed the Fréchet Inception Distance (FID) score between the 600 generated images and a random sample of 600 real images from the dataset. The achieved FID score of 45.2 indicates a reasonable level of visual similarity.
ADown module results
The integration of the ADown module was designed to preserve fine-grained feature details during downsampling, which is crucial for recognizing small text and logos on cigarette boxes. To verify this, we visualized the feature maps and observed that the ADown module retained more high-frequency spatial information compared to the standard convolutional downsampling. This qualitative improvement translates into quantitative gains in detection accuracy, as demonstrated in the ablation study results, where the ADown module contributes notably to the reduction in model parameters and FLOPs while improving mAP.
iEMA mechanism results
The iEMA module was introduced to enhance the model's focus on small and partially occluded targets through multi-scale contextual awareness. We observed that the attention maps generated by iEMA actively highlighted regions containing small brand logos and partially visible boxes, even under clutter. The consequent boost in performance, especially for challenging cases, is quantitatively analyzed in ablation results, which show a marked improvement in recall for occluded targets.
DynamicHead results
The DynamicHead module aims to unify scale, spatial, and task-aware attention for robust multi-scale detection. Its effect is particularly evident in scenes containing cigarette boxes of vastly different sizes, where it helps in accurately localizing and classifying all instances simultaneously. The significant contribution of DynamicHead to the overall mean average precision is systematically evaluated in the ablation experiments, confirming it as the most impactful individual improvement among the three proposed modules.
Comparison experiments
To validate the performance of the enhanced model on the task of identifying cigarette brands based on their boxes, we compared it with several mainstream single-stage detection baselines (V5s, V8n, V10n, and V11n), as well as the two-stage detection algorithm Faster-RCNN. All experiments were performed in the same environment and used the same datasets and partitions of the data to ensure the fairness of the comparison. Table 2 shows that the presented model exhibited the best performance across the mAP, precision, and recall. Its mAP was 97.9%, compared with that of 97% of the baseline model. It also achieved optimal performance in terms of the number of parameters and floating-point operations (FLOPs). Its lower complexity reduced the risk of overfitting the training data, thereby strengthening its capability of generalization. This advantage also made the model easier to deploy to devices. These results demonstrate that the presented model is suitable for deployment to identify cigarette brands in real time.
For a more comprehensive comparison, we included another state-of-the-art lightweight detector, PP-YOLOE-l, and an RT-DETR-l model that is based on a transformer in the evaluation. As shown in Table 2, the proposed model consistently outperforms these competitors in mAP while maintaining highly competitive model size and computational cost. The superiority of the improved model over the baseline model in terms of mAP was confirmed to be statistically significant (p-value = 0.015 < 0.05) using a paired t-test on the per-image AP scores from the test set.
Ablation experiments
Ablation experiments are used to systematically assess the contributions of components of a given model to its performance by progressively removing or modifying specific modules. We performed ablation experiments on three improved modules of the original model using a specialized dataset, with the aim of verifying the impact of adding these improved modules and their fusion on the overall accuracy of the model. The results are shown in Table 3.
Table 3 shows that the enhanced model achieved the best detection performance under the same experimental conditions, with an improvement of 0.9% in the mAP over the baseline model. Specifically, integrating the ADown module alone yielded a gain of 0.5% in the mAP. This module uses dual-branch collaborative processing for adaptive downsampling, fuses multi-scale features to preserve rich information on details while reducing the spatial resolution, and uses grouped convolutions to reduce the number of redundant inter-channel computations. The fusion of the modules yielded further gains: The integration of ADown and iEMA improved the mAP by 0.6%, while combining ADown with DynamicHead increased the mAP by 0.8%. Instances of standalone integration revealed that the iEMA module boosted the mAP by 0.3% while the DynamicHead module increased it by 0.6%. The combination of the iEMA and DynamicHead improved the mAP by 0.6%. While it did not exceed the gain incurred by adding only DynamicHead, it increased the precision by 1.1% to indicate complementary benefits. The full integration of all three modules yielded an improvement of 0.9% in the mAP while reducing the number of parameters and FLOPs. This reduced the complexity of the model and the risk of overfitting of the training data, thereby enhancing the capability of generalization of the model.
In summary, the presented model achieved an improvement of 0.9% in the mAP compared with the baseline model, reaching 97.9%. While reducing both the number of parameters and FLOPs, it obtained a speed of detection of 38.46 FPS, which satisfies the requirements of real-time operation. This also shows that the proposed model can identify diverse brands of cigarettes of varying sizes.
Overall conclusion of experimental findings
Figure 7 demonstrates the cigarette brand detection results of the presented model. It is evident from this that the recognition box of the model was able to accurately locate the region containing important features of the cigarette box and identify it accurately, especially in images with incomplete features and uneven illumination.
In summary, the experimental results conclusively demonstrate the efficacy of each proposed component in the presented model. The synergistic effect of the region-specific data augmentation, the detail-preserving ADown module, the context-aware iEMA mechanism, and the versatile DynamicHead leads to a model that is not only more accurate but also more efficient and robust to the challenges present in automated warehouse stocktaking. These findings validate the hypothesis that targeted architectural innovations can significantly advance the state of practical object detection in industrial applications.
Data availability:
The cigarette box brand image dataset generated and analyzed during this study, and the source code for the proposed model, are available from the corresponding author upon reasonable request. The data belongs exclusively to the cigarette factory, and due to legal and ethical limitations, we are unable to submit it to expert or general databases. If anyone else wishes to obtain the data, please contact the corresponding author of this article, Xiaobo Zhao (zxbggg123@163.com)

Figure 1: Image data obtained by the visual system. Representative raw images of cigarette box brands captured by the industrial camera system under field conditions. Images are captured at a resolution of 1280 x 1024 pixels with a strip LED light to ensure uniform illumination. The standard dimensions of a cigarette box serve as a reference for scale, with the camera positioned at an approximate working distance of 550 mm. Please click here to view a larger version of this figure.

Figure 2: Overall framework of data augmentation technology. Schematic workflow illustrating the integration of region-specific copy-paste and traditional augmentation techniques. SAM refers to the Segment Anything Model used for precise mask generation, and YOLOv11-Seg(baseline_model-Seg) denotes the instance segmentation model used for identifying available background space. Arrows indicate the sequential flow from object segmentation to synthetic image generation. Please click here to view a larger version of this figure.

Figure 3: Structure of the proposed model. Detailed architectural diagram illustrating the integration of the three core innovations. ADown indicates Adaptive Downsampling; iEMA indicates the inverted Efficient Multi-scale Attention mechanism; and DynamicHead represents the unified scale, spatial, and task-aware detection head. Please click here to view a larger version of this figure.

Figure 4: Structure of the ADown module. Detailed internal architecture of the Adaptive Downsampling module. It features a dual-branch structure: the first branch uses a 3x3 convolution with stride 2, while the second branch combines 2x2 max-pooling with a 1x1 convolution. This design preserves both high-frequency details and broad contextual information during spatial reduction. Please click here to view a larger version of this figure.

Figure 5: Structure of the iEMA module. Internal layout of the inverted Efficient Multi-scale Attention module. The module embeds an EMA component within an inverted residual block (iRMB) structure to capture cross-scale spatial interactions and dynamically recalibrate feature weights. Please click here to view a larger version of this figure.

Figure 6: Structure of the DynamicHead module. Unified attention framework for the detection head. It sequentially applies scale-aware (πL), spatial-aware (πS), and task-aware (πC) attention mechanisms to the input feature maps to improve localization and classification accuracy. Please click here to view a larger version of this figure.

Figure 7: Detection results of the proposed model under various challenging scenarios. (A-B) Accurate detection under uneven illumination. (C-D) Robust identification of partially occluded targets. (E-F) Successful detection of small-sized cigarette boxes. (G-H) Correct classification amidst complex background clutter. Please click here to view a larger version of this figure.
| Parameter | Parameter Values |
| CPU | Inter(R)Core i9-9900K |
| GPU | NVIDIA 3080Ti 12G |
| Operating system | Windows 10 |
| development environment | Python 3.9, PyTorch 2.1.0, CUDA 12.1 |
Table 1: Experimental environment. Specifications of the hardware and software utilized for model training and performance evaluation. Definitions: GPU: Graphics Processing Unit (NVIDIA GeForce RTX 3080 Ti); CPU: Central Processing Unit; CUDA: Compute Unified Device Architecture for GPU acceleration.
| Model | mAP/% | Precision /% | Recall/% | Parameters | FLOPs(G) |
| Faster-RCNN | 9.97 | 11.2 | 15.3 | 41127845 | 207.1 |
| YOLOv5s | 67.2 | 88.6 | 63.1 | 7112611 | 16.1 |
| YOLOv8n | 92.9 | 89.2 | 86.7 | 3013058 | 8.1 |
| YOLOv10n | 95.6 | 93 | 89.4 | 2709236 | 8.3 |
| YOLOv11n | 97 | 96.6 | 92.1 | 2597234 | 6.5 |
| RT-DETR-l | 96.5 | 96 | 95 | 32884166 | 108.2 |
| PP-YOLOE-l | 72.7 | 58.3 | 79.7 | 52200000 | 110.07 |
| Improved YOLOv11n-AiD | 97.9 | 95.4 | 94.6 | 1849678 | 5.1 |
Table 2: Comparison of results. Quantitative performance metrics of the proposed enhanced detector versus baseline one-stage (V5s, V8n, V10n, V11n) and two-stage (Faster R-CNN) detectors. Definitions: mAP (%): Mean Average Precision at a 0.5 Intersection over Union (IoU) threshold; Params: Total number of trainable parameters; FLOPs (G): Floating-point operations in Giga-operations (computational complexity); FPS: Frames Per Second (inference speed).
| Model | mAP/% | P /% | R/% | Parameters | FLOPs(G) |
| YOLOv11n | 97 | 96.6 | 92 | 2597234 | 6.5 |
| YOLOv11n-ADown | 97.5 | 93.1 | 95 | 2107002 | 5.2 |
| YOLOv11n-iEMA | 97.3 | 94.7 | 93 | 2594538 | 6.4 |
| YOLOv11n-DynamicHead | 97.6 | 95.3 | 95 | 2274014 | 6.1 |
| YOLOv11n-ADown-iEMA | 97.6 | 94.8 | 95 | 2111978 | 5.2 |
| YOLOv11n-ADown-DynamicHead | 97.8 | 96.7 | 94 | 1844702 | 5 |
| YOLOv11n-iEMA-DynamicHead | 97.6 | 96.4 | 94 | 2278990 | 6.2 |
| YOLOv11n-AiD | 97.9 | 95.4 | 95 | 1849678 | 5.1 |
Table 3: Results of ablation experiments. Systematic evaluation of the individual and collective contributions of the proposed modules. A checkmark () indicates that the corresponding module (ADown, iEMA, or DynamicHead) is included in the model configuration. Performance indicators (Precision, Recall, mAP) follow the calculation formulas defined in Equations 5-8.
Accuracy-efficiency trade-off
A central achievement of this model is its superior balance between accuracy and computational efficiency. As evidenced in Table 2, the presented model achieves the highest mAP while simultaneously having the lowest number of parameters and the second-lowest FLOPs among the single-stage detector models compared. This translates directly to practical benefits: a smaller model is faster to deploy, requires less memory, and has lower inference latency and power consumption on both GPU and edge hardware. This makes it exceptionally suitable for real-time applications on resource-constrained platforms, such as the stacker crane's onboard computer, where computational budgets are tight.
Limitations
Despite its strong performance, the enhanced model has several limitations. Its accuracy is contingent upon the representativeness of the training data. Performance may degrade under extreme lighting conditions (e.g., strong specular reflections) or when cigarette boxes are occluded beyond ~70%, scenarios that were underrepresented in this dataset. Exploring and integrating advanced techniques designed explicitly for heavy occlusion, such as those proposed in a relevant robust detection paper39, could be a promising direction to address this limitation in the future. Additionally, a real-time performance of 38.5 FPS was achieved on a high-end GPU (NVIDIA 3080 Ti); however, its frame rate and power consumption on other embedded hardware (e.g., NVIDIA Jetson Nano) requires further empirical validation. In addition, although the model is robust to moderate occlusion and light changes in the data set, the number of samples in the current data set under extreme conditions (for example, severe occlusion>70%, resolution is very low) is limited, so it is not feasible to conduct a thorough quantitative evaluation.
Comparison with alternative methods
This work builds upon the single-stage detector architecture, prized for its speed-accuracy balance. Other paradigms, such as Transformer-based detectors (e.g., DETR), offer the advantage of simplified post-processing but often demand more data and computational resources for training. Another relevant lightweight model is PaddlePaddle You Only Look Once Everything, which utilizes efficient structures such as RepResBlocks. The choice of the baseline model allowed us to build upon a mature, highly optimized codebase, while the custom modules (ADown, iEMA, DynamicHead) provided the specific enhancements needed for our unique industrial challenge, a strategy that may be more pragmatic than adopting a completely different architecture.
Future work
Future research directions are as follows. A primary focus will be on constructing a dedicated dataset encompassing a wide range of extreme scenarios, including severe occlusion, extreme lighting, and low-resolution images, to enable rigorous quantitative evaluation and further model refinement. As a preliminary exploration, we qualitatively inspected the few available challenging samples in our test set. As shown in Figure 7, the model successfully detected some cigarette boxes with approximately 50% occlusion and others under extremely low lighting, demonstrating its potential robustness. These observations, while not statistically significant, are encouraging and warrant further investigation with a larger-scale dataset. Conducting rigorous deployment testing on embedded systems to quantify real-world power consumption and inference latency, which are critical for large-scale industrial adoption, is another aspect to work on. Exploring model compression techniques like quantization and pruning to further reduce the model size for edge deployment is one of the future directions. Investigating the transferability of the proposed improvements to other industrial object detection tasks, such as parts inspection in manufacturing or parcel sorting in logistics, will also be addressed in the future.
Broad implications
This study contributes to the field of industrial intelligent vision by demonstrating a holistic approach to solving a practical problem. It underscores that beyond merely selecting a state-of-the-art model, significant gains can be made through domain-specific data augmentation and targeted architectural refinements. The success of the enhanced model provides a valuable blueprint for applying deep learning to automated warehouse systems and other industrial settings where accuracy, speed, and efficiency are paramount.
The model's significantly reduced parameter counts and FLOPs make it a compelling candidate for deployment in embedded systems commonly used in industrial machinery. This addresses a critical practical concern beyond pure accuracy: the total cost of ownership, which encompasses hardware cost, power consumption, and integration complexity. Future empirical measurements on such devices will provide concrete data on power draw and inference latency, further validating their suitability for large-scale deployment.
Conclusions
This article proposes an improved Single-stage Regression Architecture to identify brands of cigarettes based on their packaging, with the aim of enhancing the detection of cigarette brands in intelligent stocktaking. First, we addressed the severe class imbalance in the dataset by designing a region-specific copy-paste method of data augmentation to improve the capability of generalization of the model. Second, the ADown module was used for convolutional downsampling in both the backbone and neck networks. It can preserve rich feature details in the images. Third, the iEMA module was introduced to enhance the capability of the model to identify small and occluded targets. Finally, the DynamicHead module was used to replace the original detection head. It enabled the identification of multi-scale objects in the feature maps extracted from the backbone and neck networks. This yielded the precise localization of the predicted targets, their categorical classification, and improved assessment of the accuracy of the bounding boxes.
The results of experiments showed that the above optimizations collectively increased the mAP of the proposed model by 0.9%, to 97.9%, compared with the baseline model. Its frame rate reached 38.5 FPS, which means that it can satisfy the requirements of real-time detection40. With only 1,849,678 parameters, the proposed model is feasible for deployment in empirical settings and can provide effective technical and theoretical support for identifying cigarette brands based on their boxes. Despite its promising results, this study has limitations. The model's robustness under a wider array of extreme occlusions and lighting conditions requires further validation. Future work will focus on testing the model in such demanding scenarios, extending the dataset to include more cigarette brands, and optimizing the model for deployment on low-power embedded hardware to assess its practical industrial viability and cost-effectiveness.
The authors declare no direct competing financial interests. This research was conducted in collaboration with Longyan Tobacco Industrial Co., Ltd., where authors Hongli Deng and Wencan Li are employed, and Baoji Cigarette Factory, where author Baobin Luo is employed. These affiliations provided the application scenario and field data for the study. The academic authors (Jun Liu, Jianguang Yi, Feng Yang, Xiaobo Zhao) declare no financial or non-financial conflicts of interest regarding the publication of this work.
This work was financially supported by the Temperature Measuring Method of Casting Billet Based on Preceded Reflector and Multi-wavelength (No.LZY24E050002) funded by the Joint Funds of the Zhejiang Provincial Natural Science Foundation of China, and the Online Temperature Field Measuring Method of Non-closed and Non-isothermal Ladle Cavity (No. 2023K231) funded by the Quzhou Science and Technology Planning Project.
| code reader | Hikvision | ID5050 | Hikvision equipment: Used to read barcodes on pallets, Need to connect 24V power supply |
| industrial camera | Hikvision | MV-CA013-A0GM, MV-CS016-10GM | Need to connect 24V power supply |
| LED light | HIKROBOT | MV-LLDS-1002-38-W | A one meter strip light source provides sufficient lighting conditions for the camera |
| lens | Hikvision | MVL-HF0828M-6MPE | focal length:8mm, Aperture range:F2.8~F16, pixel:6 million |
| MVS | Hikvision | V4.5.1 | Hikvision software: Used for debugging cameras, setting camera parameters, and data collection |
| Pycharm | JetBrains | 2020.1.3 x64 | Used for training deep learning models and developing detection systems |
| Several metal brackets | Longyan Tobacco Industrial Co., Ltd. | 7075-T6 Aluminum Alloy | Used for fixing cameras and code readers |
| Several network cables | Longyan Tobacco Industrial Co., Ltd. | RJ45 Crystal Head | Connect the base station between the industrial computer and wireless AP, connect the camera and switch, etc |
| Swicth | TP-Link | TL-SH1005 | There are 8 Ethernet cable interfaces that require 220V power supply |
| Vision Master | Hikvision | V4.3.0 | Hikvision software: Used for template matching and detecting image results |
| wireless AP device | Shandong Huachuangxunlian | BH-ANT5158S-14HV, BH-MS-AC1600HWH | Due to the large-scale movement required by the stacker crane, the network cable cannot connect the camera to the industrial computer, and a wireless AP device is needed to ensure the smooth operation of the camera network |