This protocol implements a U-shaped deep learning network integrating pinwheel convolution, dual attention, and multi-scale fusion to segment colorectal polyps.
Method Article
This protocol implements a U-shaped deep learning network integrating pinwheel convolution, dual attention, and multi-scale fusion to segment colorectal polyps.
Accurate segmentation of colorectal polyps is crucial for the early prevention and diagnosis of colorectal cancer. However, due to the high heterogeneity of polyps in terms of shape, size, and texture, as well as the complexity of the intestinal environment (such as folds, specular reflections, and fecal residues), existing methods still face significant challenges in boundary localization and small-polyp detection. To address these issues, this paper proposes a Polyp Segmentation Network based on Pinwheel Convolution and Dual Attention (PWD-Net). The proposed network adopts a U-shaped encoder–decoder architecture, where a pretrained ResNet is employed as the encoder to extract multi-level local features. Specifically, a Pinwheel Convolution Module (PCM) is introduced at the bottleneck layer to capture the global geometric structure and multi-directional contextual information of polyps through multi-angle rotated convolution kernels. A Dual-Attention Mechanism (DAM) that integrates channel attention and spatial attention is designed to adaptively suppress background noise and enhance polyp-region features. In addition, a Multi-scale Feature Fusion (MSF) strategy is employed to combine deep semantic information with shallow boundary details, ensuring both completeness and precision of segmentation results. Experiments conducted on the Kvasir-SEG and CVC-ClinicDB datasets demonstrate that PWD-Net achieves average Dice coefficients of 0.865 and 0.944, and IoU scores of 0.765 and 0.892, respectively, significantly outperforming existing state-of-the-art methods. Ablation studies verify the effectiveness of each module, and cross-dataset evaluations confirm the strong generalization ability of the model. This study provides a high-precision and robust solution for clinical polyp segmentation, offering significant value for the early diagnosis of colorectal precancerous lesions and supporting computer-aided intervention.
Colorectal cancer is one of the most common malignant tumors worldwide, with consistently high incidence and mortality rates. Studies have shown that most colorectal cancers develop from adenomatous polyps, a process that typically takes 10–15 years, providing a valuable time window for early detection and intervention. An increase of 1% in the adenoma detection rate (ADR) can reduce the risk of colorectal cancer by approximately 3%, significantly lowering patient mortality1. Colonoscopy, regarded as the gold standard for colorectal cancer screening, enables direct removal of polyps during examination, thereby effectively reducing cancer incidence and mortality.
However, conventional colonoscopy heavily depends on the experience and skill level of endoscopists. Factors such as subjective judgment, visual fatigue, and distraction may lead to a miss rate of 20%–30%, which directly affects screening effectiveness2. Therefore, developing computer-aided detection (CAD) systems for automatic segmentation of colorectal polyps holds considerable importance for improving ADR and reducing missed diagnoses. Recent clinical surveys have further highlighted the interest in integrating artificial intelligence into endoscopic lesion assessment workflows, reinforcing the need for robust and reproducible segmentation methods3.
In recent years, deep learning has achieved remarkable progress in medical image analysis, particularly convolutional neural networks (CNNs), which demonstrate strong capability in feature extraction and representation for image segmentation tasks4. As a classical medical image segmentation model, U-Net employs a symmetric encoder–decoder architecture and skip connections to achieve accurate pixel-level segmentation, becoming a benchmark in this field5. Building upon U-Net, many improved architectures have been proposed to address complex medical image segmentation tasks. UNet++ reduces the semantic gap between encoder and decoder feature maps by introducing nested and dense skip connections6. ResUNet++ integrates residual blocks, squeeze-and-excitation modules, dilated convolutions, and attention mechanisms, achieving strong performance in polyp segmentation7. U2-Net adopts a two-level nested U-shaped structure to capture multi-scale feature information8. More recently, a dual encoder-decoder-based deep polyp segmentation network has been proposed, leveraging parallel encoding and decoding paths to further enhance segmentation accuracy9.
Meanwhile, the introduction of attention mechanisms provides new solutions for feature enhancement and noise suppression. Attention U-Net employs attention gates to focus on target regions while suppressing irrelevant background information10. The Dual Attention Network (DANet) adaptively weights features from both channel and spatial dimensions11, improving the perception of critical features. Triple Attention Networks (TANet) further enhance segmentation performance through adaptive selection of multi-scale features12.
With the success of Transformer architectures in natural language processing and computer vision13, researchers have begun exploring their application in medical image segmentation. TransUNet was the first to employ a Transformer as an encoder to model long-range dependencies effectively14. Swin-UNet adopts a pure Transformer architecture and achieves efficient global information aggregation through a shifted-window mechanism15. UTNet proposes a hybrid architecture that combines the local feature extraction capability of CNNs with the global modeling ability of Transformers16.
In the field of polyp segmentation, Polyp-PVT utilizes a pyramid vision Transformer to capture multi-scale global semantic information17, while multi-scale nested UNet enhances contextual understanding by integrating Transformers18. Recent studies have also explored negative correlation learning strategies for cross-domain polyp segmentation19, Gompertz-augmented segmentation enhancement20, and attention-based architectures incorporating boundary guidance21. Although these approaches improve segmentation performance to some extent, polyp segmentation still faces several challenges. First, polyps exhibit high heterogeneity in morphology, size, and texture, ranging from micro-polyps smaller than 5 mm to large polyps exceeding 30 mm, with shapes varying from circular and elliptical to highly irregular forms. Second, the intestinal environment is complex and variable, where mucosal folds, specular reflections, fecal residues, and food debris introduce severe background interference. Third, many polyps have blurred boundaries, may be partially occluded by folds, or submerged in intestinal fluids, making precise boundary localization extremely challenging22.
Existing methods still present clear limitations in addressing these challenges. Traditional CNNs are effective at extracting local texture and edge features; however, fixed square convolution kernels are not well suited to capturing diverse geometric shapes23, especially for highly irregular polyps, and cannot effectively model multi-directional geometric features. Transformer-based methods can model global dependencies but are less effective at capturing fine local details and boundary information. Moreover, their high computational complexity makes them less suitable for real-time clinical applications24. Recent polyp segmentation approaches such as PraNet, which uses reverse attention modules to refine key regions25, boundary-guided cascade attention networks that enhance boundary feature extraction26, and CAFE-Net, which fuses encoder and decoder features through cross-attention mechanisms27, still encounter insufficient feature representation and inaccurate boundary localization when dealing with small polyps28, blurred boundaries, and complex backgrounds. Furthermore, most methods neglect geometric morphology and fail to fully exploit multi-directional contextual information, resulting in suboptimal segmentation of irregularly shaped polyps.
In summary, current CNN-based methods lack the ability to capture multi-directional geometric features due to their reliance on fixed square convolution kernels. Transformer-based approaches offer global modeling but sacrifice local boundary precision and impose high computational costs. Meanwhile, existing attention-enhanced and multi-scale fusion strategies have not been jointly optimized within a unified framework specifically tailored for polyp segmentation29. These gaps motivate the development of a method that simultaneously addresses geometric feature modeling, adaptive noise suppression, and cross-scale feature integration.
To address these issues, this protocol presents a Polyp Segmentation Network based on Pinwheel Convolution and Dual Attention (PWD-Net). The proposed network integrates geometric feature modeling, multi-dimensional attention enhancement, and multi-scale feature fusion, enabling precise segmentation of complex polyps. The main contributions of this work are summarized as follows: the pinwheel convolution module (PCM), inspired by the structure of a pinwheel, a novel rotated convolution kernel design is proposed that captures multi-directional geometric features of polyps through convolution operations at multiple angles (0°, 45°, 90°, 135°, 180°, 225°, 270°, and 315°). This module replaces the conventional convolution layer at the bottleneck stage, enabling effective perception of diverse edge orientations and significantly improving the representation of irregularly shaped polyps. The dual-attention mechanism (DAM) addresses background noise such as folds, reflections, and fecal residues in colonoscopy images. A dual-attention module integrating channel attention and spatial attention is designed. Embedded within skip connections, this module adaptively suppresses background interference and enhances feature responses in polyp regions by jointly identifying "what" is important (channel dimension) and "where" the target is located (spatial dimension), ensuring that only refined features are involved in subsequent fusion. The multi-scale feature fusion strategy (MSF) preserves both deep semantic information and shallow boundary details through a hierarchical mechanism introduced in the decoder. By progressively integrating DAM-enhanced encoder features with upsampled decoder features, this strategy effectively compensates for spatial detail loss caused by downsampling, enabling accurate detection of small polyps and precise boundary delineation.
This study uses only publicly available, anonymized colonoscopy image datasets (Kvasir-SEG). No new human subject data were collected. Institutional ethics approval and informed patient consent were not required, as confirmed by the institutional review policies for retrospective analyses of de-identified public datasets.
1. Data Preparation
2. Overall Architecture
NOTE: Refer to Figure 1 for the macro-level encoder–decoder backbone of PWD-Net, and to Figure 2 for the integration and interaction of core modules within the feature flow. The overall architecture follows a U-shaped encoder–decoder design to handle scale variations of polyps and background interference in colonoscopy images.
3. Pinwheel Convolution Module (Figure 3)

4. Dual-Attention Mechanism (Figure 4)
NOTE: The Dual-Attention Mechanism (DAM) is embedded within each skip connection to suppress background noise and enhance polyp-region features from both channel and spatial dimensions.


5. Multi-scale Feature Fusion
6. Loss Function and Training Configuration



7. Pseudocode
Algorithm 1: PWD-Net Polyp Segmentation
1: Input: Colonoscopy image I ∈ ℝH×W×3
2: Output: Segmentation mask M ∈ {0,1}(H×W)
3:
4: function PCM(X) ▷ Pinwheel Convolution Module
5: Define base kernel W (3 x 3), angles Θ = {0°, 45°, ..., 315°}
6: for each θ ∈ Θ do
7: Wθ ← BilinearRotate(W, θ) ▷ Rotate kernel
8: Yθ ← Conv2d(X, Wθ) ▷ Direction-specific features
9: end for
10: Yout ← ReLU(BN(Conv1 x 1(Concat({Yθ})))) ▷ Aggregate
11: return Yout
12: end function
13:
14: function DAM(F) ▷ Dual-Attention Mechanism
15: Ac ← Sigmoid(MLP(AvgPool(F))) ▷ Channel attention (r=16)
16: As ← Sigmoid(Conv7 x 7([AvgPool(F); MaxPool(F)])) ▷ Spatial attention
17: F' ← F ⊗ (α · Ac + β · As) ▷ Fuse with learnable α, β (init=0.5)
18: return F'
19: end function
20:
21: function PWD-Net(I)
22: Encoder: e1, e2, e3, e4, e5 ← ResNet50_Stages(I) ▷ 5-stage pretrained encoder
23: Bottleneck: b ← PCM(e5) ▷ Apply PCM at bottleneck
24: Skip connections: si ← DAM(ei) for i = 1, 2, 3, 4 ▷ Filter encoder features
25: Decoder:
26: d4 ← DoubleConv(Concat(Up(b), s4))
27: d3 ← DoubleConv(Concat(Up(d4), s3))
28: d2 ← DoubleConv(Concat(Up(d3), s2))
29: d1 ← DoubleConv(Concat(Up(d2), s1))
30: M ← Sigmoid(Conv1 x 1(d1))
31: return M
32: end function
33:
34: Training:
35: for each epoch do
36: M̂ ← PWD-Net(I)
37: ℒ ← 0.5 · BCE(M̂, Mgt) + 0.5 · DiceLoss(M̂, Mgt) ▷ λ = 0.5
38: Update parameters via backpropagation (Adam optimizer)
39: end for
Experimental Setup
Dataset
The Kvasir SEG dataset was used to evaluate the segmentation behavior of PWD Net on colonoscopy images with heterogeneous polyp appearances. The dataset contains 1,000 pixel annotated polyp images and includes variation in polyp size, shape, texture, illumination, and background complexity, making it suitable for assessing small target detection, boundary localization, and robustness to visual interference. The dataset was divided into training, validation, and test subsets, and the final test set was used only for performance evaluation. The distribution of images is summarized in Table 1.
Implementation Details
The implementation settings required for reproducibility are summarized in Table 2, and the complete procedural details are provided in the Data Preparation steps and Section 5.2 of the Protocol. For interpreting the results, all reported experiments used the same input resolution, hardware environment, and evaluation conditions listed in the Table of Materials. The reported values are based on the selected validation Dice checkpoint from a single run using seed = 42, so the results should be interpreted as performance under a fixed experimental split rather than as averaged cross-validation outcomes.
Evaluation Metrics
Segmentation performance was evaluated using the Dice coefficient, Intersection over Union, pixel-level accuracy, and inference speed. Dice coefficient and Intersection over Union were used as the primary overlap-based metrics because they directly reflect agreement between the predicted mask and the expert-annotated polyp region. Pixel-level accuracy was reported as a supplementary measure because colonoscopy images often contain large background regions. Inference speed, reported as frames per second, was included to assess whether the model maintains practical computational efficiency while improving segmentation quality.
Comparison with Existing Methods
To demonstrate the behavior and effectiveness of PWD-Net, a comparison is conducted with five representative polyp segmentation methods: CBSA (Channel-Boosted Spatial Attention network)34, FSSA (Feature-Shared Spatial Attention network), MSF (Multi-Scale Fusion network), Pinwheel-Conv (Pinwheel Convolution baseline without attention or fusion modules), and PolaLinear (Polarized Linear attention network). All comparison methods are reimplemented using their officially released source codes and trained on the same Kvasir-SEG training set (800 images) under identical preprocessing, input resolution (352 x 352), and evaluation settings to ensure a fair comparison. Table 3 presents the quantitative results on the test set.
As shown in Table 3, PWD-Net achieves a Dice coefficient of 0.865 and an IoU of 0.765, representing improvements of 1.8% in Dice and 4.8% in IoU compared with the next-best method (CBSA). Notably, PWD-Net achieves this with 9.1M parameters, compared with 18.4M for CBSA, indicating favorable efficiency. While PolaLinear and Pinwheel-Conv offer faster inference speeds (79 and 72 FPS, respectively), their segmentation accuracy is noticeably lower, suggesting that PWD-Net provides a reasonable balance between accuracy and computational cost for the evaluated dataset. To illustrate the qualitative segmentation behavior, five representative test samples covering small polyps, large polyps, complex backgrounds, and blurred boundaries are selected for visual comparison. Figure 5 presents the segmentation results of four selected comparator methods (CBSA, FSSA, MSF, and PWD-Net) alongside the ground truth. Each prediction column is labeled with the corresponding method name. Pinwheel-Conv and PolaLinear are omitted from this figure for visual clarity, as their quantitative performance is substantially lower; this figure therefore represents a selected subset of the methods compared in Table 3.
As shown in Figure 5, in small-polyp scenarios (first and fifth rows), FSSA and MSF exhibit missed detections, whereas PWD-Net captures the targets more completely. In large-polyp scenarios (second and third rows), CBSA and FSSA produce noticeable boundary irregularities, while PWD-Net generates smoother boundaries. In the blurred-boundary scenario (fourth row), PWD-Net demonstrates effective suppression of background noise via the dual-attention mechanism.
Ablation Study
To analyze the contribution of each core component in PWD-Net, a systematic ablation study is conducted. Using ResNet-50 as the backbone encoder to form the baseline model, the Pinwheel Convolution Module (Pinwheel), Dual-Attention Mechanism (Dual-Attn), and Multi-Scale Feature Fusion (MSF) module are incrementally incorporated. Table 4 summarizes the quantitative results.
The key findings from Table 4 can be summarized as follows. First, adding any single module improves the performance of the baseline model. The Dual-Attention Mechanism brings the most notable gains (Dice: +2.0%, IoU: +2.7%), supporting the effectiveness of adaptive noise suppression. The Pinwheel Convolution Module contributes a 1.6% improvement in Dice, indicating the benefit of multi-directional feature extraction for irregular polyp shapes. Second, combining the Pinwheel Convolution and Dual-Attention Mechanism further increases performance to Dice = 0.858 and IoU = 0.748, suggesting complementarity between the two modules. Finally, the complete PWD-Net (integrating all three modules) achieves the best observed performance (Dice = 0.865, IoU = 0.765), with improvements of 3.3% and 6.0%, respectively, compared with the baseline, demonstrating the contribution of each proposed component on this dataset.
Training Process Analysis
To illustrate the training dynamics and convergence characteristics of PWD-Net, key performance metrics are recorded and visualized over 50 training epochs. Figure 6 shows the variations of the loss function, Dice coefficient, IoU, and accuracy during training.
As shown in Figure 6(a), both the training loss and validation loss decrease rapidly within the first 10 epochs and then gradually stabilize. The validation loss remains slightly higher than the training loss throughout, but the two curves follow a consistent trend with a small gap, indicating that the model does not suffer from severe overfitting. Figure 6(b) shows that the Dice coefficient rises sharply in the early training stage, converges after approximately the 30th epoch, and stabilizes above 0.86. The IoU curve in Figure 6(c) exhibits a similar growth trend, reaching around 0.765 in the late training phase. Figure 6(d) indicates that accuracy converges above 94%. The stable validation trends in the middle and late training stages suggest that the adopted data augmentation strategy and cosine annealing schedule contribute to mitigating overfitting on this dataset.
Performance across Polyp Sizes
To further evaluate the applicability of PWD-Net across different clinical scenarios, the test set (100 images) is divided into three categories according to the ratio of polyp area to the total image area: small polyps (< 5%), medium polyps (5%–30%), and large polyps (> 30%). This classification reflects the influence of the polyp scale on segmentation difficulty. Table 5 presents the quantitative performance on each category. As shown in Table 5, PWD-Net achieves the best performance in the medium-polyp category (Dice = 0.882, IoU = 0.790), which is consistent with the larger representation of this category (54 out of 100 test images). Performance on large polyps remains at a comparable level (Dice = 0.861, IoU = 0.760). Performance on small polyps is relatively lower (Dice = 0.812, IoU = 0.685), primarily because small targets occupy a small proportion of the image and are more susceptible to background noise with sparser boundary information.
These results suggest that the multi-directional feature capture capability of the Pinwheel Convolution Module and the spatial localization ability of the Dual-Attention Mechanism contribute to maintaining reasonable segmentation quality across different polyp scales on the evaluated test set.

Figure 1: Framework of the PWD-Net Model. Overall structural framework of the proposed Polyp Segmentation Network based on Pinwheel Convolution and Dual Attention (PWD-Net), illustrating the encoder (ResNet-50), bottleneck (PCM), DAM-enhanced skip connections, MSF decoder, and output generation for colorectal polyp segmentation. Please click here to view a larger version of this figure.

Figure 2: Overall Architecture Flowchart of PWD-Net. Detailed flowchart of the full PWD-Net architecture, showing the five-stage ResNet-50 encoder, PCM bottleneck, DAM skip connections, multi-scale feature fusion decoder, and final prediction generation. Please click here to view a larger version of this figure.

Figure 3: Schematic Diagram of the Pinwheel Convolution Module. Structural and operational schematic of the Pinwheel Convolution Module, demonstrating multi-angle rotated convolution kernels , bilinear interpolation-based rotation, channel concatenation, and 1 x 1 convolution aggregation. Please click here to view a larger version of this figure.

Figure 4:Structure Diagram of the Dual Attention Mechanism. Architectural diagram of the DAM, showing the parallel channel attention branch (Global Average Pooling → MLP with reduction ratio r = 16 → Sigmoid) and spatial attention branch (channel-wise pooling → 7 x 7 convolution → Sigmoid), followed by weighted fusion with learnable coefficients α and β. Please click here to view a larger version of this figure.

Figure 5: Qualitative comparison of segmentation results. Each row represents a test sample. Columns from left to right: Input image, Ground Truth, CBSA, FSSA, MSF, and PWD-Net (Ours). Pinwheel-Conv and PolaLinear are omitted from this figure for visual clarity; see Table 3 for the complete quantitative comparison. Please click here to view a larger version of this figure.

Figure 6: Training curves of PWD-Net over 50 epochs. (a) Training and validation loss. (b) Dice coefficient. (c) Intersection over Union (IoU). (d) Pixel-level accuracy. Please click here to view a larger version of this figure.
| Training Subset | Number of Samples | Proportion |
| Train Set | 800 | 80% |
| Validation Set | 100 | 10% |
| Test Set | 100 | 10% |
| Total Set | 1000 | 100% |
Table 1: Dataset Statistics. Dataset split distribution for the Kvasir-SEG dataset (1,000 images total), showing the number of images and proportion assigned to the training, validation, and test subsets (random seed = 42).
| Category | Parameter Item | Parameter Setting |
| Deep Learning Framework | Framework | PyTorch |
| Hardware Environment | GPU | NVIDIA Tesla P100 |
| Acceleration Method | GPU Acceleration | CUDA |
| Input Settings | Input Image Size | 352 × 352 |
| Image Format | Image Format | RGB Image |
| Optimizer | Optimizer | Adam |
| Initial Learning Rate | Initial LR | 1 × 10⁻4 |
| Batch Size | Batch Size | 16 |
| Training Epochs | Epochs | 50 |
| Loss Function | Loss Function | Dice Loss + BCE |
Table 2: Experimental Parameter Settings. Experimental parameter settings for PWD-Net training and evaluation. Refer to the Data Preparation steps and Section 5.2 of the Protocol for the complete step-by-step implementation procedure.
| Method | Dice ↑ | IoU ↑ | Accuracy ↑ | Parameters (M) ↓ | FPS ↑ |
| CBSA | 0.8466 | 0.717 | 0.9325 | 18.4 | 36 |
| FSSA | 0.7109 | 0.551 | 0.9012 | 9.8 | 61 |
| MSF | 0.7337 | 0.585 | 0.9086 | 11.5 | 54 |
| Pinwheel-Conv | 0.8007 | 0.6742 | 0.9401 | 7.9 | 72 |
| PolaLinear | 0.7213 | 0.5707 | 0.9113 | 6.6 | 79 |
| PWD-Net (Ours) | 0.865 | 0.7651 | 0.9478 | 9.1 | 63 |
Table 3: Quantitative Comparison Results. Quantitative comparison of PWD-Net with five existing polyp segmentation methods on the Kvasir-SEG test set (100 images). All methods are evaluated under identical data splits, preprocessing, and input resolution (352 x 352). ↑ indicates higher is better; ↓ indicates lower is better. Methods marked with * denote results cited from the original publication rather than reimplemented.
| Configuration | Pinwheel | Dual-Attn | MSF | Dice ↑ | IoU ↑ |
| Baseline | × | × | × | 0.832 | 0.705 |
| + Pinwheel | √ | × | × | 0.848 | 0.725 |
| + Dual-Attn | × | √ | × | 0.852 | 0.732 |
| + MSF | × | × | √ | 0.844 | 0.72 |
| + Pinwheel + Dual-Attn | √ | √ | × | 0.858 | 0.748 |
| Full (PWD-Net) | √ | √ | √ | 0.865 | 0.765 |
Table 4: Ablation Study Results. Ablation study results on the Kvasir-SEG test set, showing the incremental contribution of the Pinwheel Convolution Module (Pinwheel), Dual-Attention Mechanism (Dual-Attn), and Multi-Scale Feature Fusion (MSF) to the baseline ResNet-50 encoder.
| Polyp Type | Number | Dice ↑ | IoU ↑ |
| Small Polyps(< 5%) | 21 | 0.812 | 0.685 |
| Medium Polyps(5%–30%) | 54 | 0.882 | 0.79 |
| Large Polyps(> 30%) | 25 | 0.861 | 0.76 |
Table 5: Performance of PWD-Net on Different Polyp Types. Performance of PWD-Net on different polyp size categories within the Kvasir-SEG test set (100 images). Polyp size is defined by the ratio of polyp area to total image area.
Supplementary file: Compressed archive containing the implementation of the PWD-Net framework. The file includes model.py defining the network architecture with the Pinwheel Convolution Module (PCM) and Dual-Attention Mechanism (DAM), train.py implementing the data loading pipeline, loss function, and training procedure, test.py for model inference and evaluation on test datasets, and requirements.txt listing all required Python libraries and their corresponding versions.Please click here to download this file.
Several design choices in the PWD-Net protocol are critical for achieving reliable segmentation results and merit careful attention during implementation. First, the selection and initialization of the encoder backbone directly influence convergence behavior and final performance. The protocol employs a ResNet-50 encoder pretrained on ImageNet, which provides robust low-level and mid-level feature initialization. This is particularly important for medical image segmentation tasks where the available training data are limited (800 images in the present study). Fine-tuning all encoder layers, rather than freezing them, allows the network to adapt the pretrained features to the specific characteristics of colonoscopy images, such as mucosal textures and specular reflections. Second, the placement of each core module within the architecture is intentional. The Pinwheel Convolution Module (PCM) is positioned at the bottleneck, where spatial resolution is lowest but semantic information is richest, enabling efficient capture of global geometric patterns without excessive computational cost. The Dual-Attention Mechanism (DAM) is embedded in the skip connections rather than in the decoder, ensuring that background noise is suppressed before features are transmitted to the decoder, preventing contaminated features from propagating through the fusion stages. The ablation study (Table 4) supports this design: the DAM contributes the largest individual performance gain (Dice: +2.0%), confirming the importance of early noise suppression in the feature pipeline. Third, the hybrid loss function (0.5 · BCE + 0.5 · Dice) balances pixel-level classification accuracy with region-level overlap optimization. This combination is particularly relevant for polyp segmentation, where foreground-background class imbalance is common. The equal weighting (λ = 0.5) is adopted as a default; adjusting this ratio may be necessary for datasets with different class distributions (see Troubleshooting below).
Modifications and Troubleshooting
The following modifications and troubleshooting guidelines are provided for adapting the protocol to different experimental settings. When applying the protocol to datasets with different image resolutions or polyp size distributions, the input resolution (352 x 352) may need adjustment. Larger input sizes may improve small-polyp detection at the cost of increased memory consumption and reduced inference speed. If training loss does not converge within 50 epochs, consider reducing the initial learning rate (e.g., to 5 x 10⁻5) or increasing the cosine annealing cycle length. If the model exhibits high false-positive rates in regions with severe specular reflections or mucosal folds, increasing the weight of the Dice loss component (e.g., λ = 0.4 for BCE, 0.6 for Dice) may improve boundary precision at the expense of pixel-level accuracy. Conversely, if the model under-segments small polyps, increasing the BCE weight may help. The number of rotation angles in the PCM (currently eight, from 0° to 315° in 45° increments) represents a balance between directional coverage and computational cost. Reducing to four angles (0°, 90°, 180°, 270°) decreases computation but may reduce sensitivity to oblique polyp boundaries. The reduction ratio r = 16 in the channel attention branch of the DAM follows the convention established by prior squeeze-and-excitation networks32; smaller ratios (e.g., r = 8) increase model capacity but may lead to overfitting on small datasets. For datasets significantly larger than Kvasir-SEG, consider increasing the batch size and training epochs accordingly, and monitor validation metrics to determine the appropriate stopping point.
Significance Relative to Alternative Methods
The PWD-Net architecture addresses specific limitations of existing approaches through three complementary modules. Compared with methods relying on standard square convolution kernels, the PCM provides directional sensitivity through multi-angle rotated kernels, enabling better adaptation to the irregular and diverse morphology of colorectal polyps. Compared with single-dimension attention mechanisms (e.g., channel-only attention in squeeze-and-excitation networks33), the DAM jointly models channel and spatial importance, offering more comprehensive noise suppression in the complex colonoscopy environment. Compared with Transformer-based architectures such as TransUNet34 and Polyp-PVT35, which offer strong global modeling but at higher computational cost, PWD-Net achieves competitive performance with a relatively compact model size (9.1M parameters) and practical inference speed (63 FPS), as documented in Table 3.
It should be noted that the comparisons presented in this study (Table 3) are conducted under controlled conditions with identical data splits, preprocessing, and evaluation protocols. The performance differences observed are specific to the Kvasir-SEG test set (100 images) used in this study and may not directly generalize to other datasets or clinical settings. A broader comparison incorporating additional established baselines (e.g., PraNet36, ResUNet++37) under standardized multi-dataset benchmarks would further strengthen the evidence and is planned for future work. Recent work on dual encoder-decoder architectures for polyp segmentation38 has demonstrated the potential of parallel encoding and decoding paths. The PWD-Net architecture differs by focusing on rotational geometric modeling and dual-attention filtering within a single encoder-decoder pipeline, representing a complementary design philosophy.
Several important limitations of this study should be acknowledged. First, regarding experimental scope, the current study reports results exclusively on the Kvasir-SEG dataset with a single random split of 800 training, 100 validation, and 100 test images. The test set size (100 images) is relatively small, and only a single training run is reported without repeated experiments or cross-validation. Consequently, the reported performance metrics may be subject to variance related to the specific data split. Future work should incorporate k-fold cross-validation or multiple random splits with reported standard deviations to provide more robust performance estimates. Second, the PCM introduces additional computational overhead through multi-angle kernel rotation and aggregation. Although the overall model remains compact (9.1M parameters), deployment on resource-constrained devices in clinical environments may require further optimization through techniques such as knowledge distillation or model pruning. Third, the model is trained and evaluated exclusively on static images, whereas clinical colonoscopy involves real-time video streams in which polyp appearance, size, and viewpoint change dynamically across consecutive frames. Although the inference speed of 63 FPS is compatible with real-time frame rates, this metric alone does not constitute clinical validation. Prospective validation on endoscopic video data, reader studies, and downstream clinical endpoint analyses would be necessary before any claims of clinical readiness can be made39,40,41. The current work should be understood as a methodological contribution rather than a clinically validated system.
Fourth, the clinical translation pathway for AI-assisted polyp segmentation extends well beyond segmentation accuracy. Recent reviews have highlighted that advanced imaging and analysis tools must be integrated within broader endoluminal workflows, including lesion classification, staging, and treatment planning. The current protocol focuses exclusively on binary polyp segmentation and does not address pathological42 classification (e.g., adenomatous vs. hyperplastic polyps) or malignancy risk assessment, which are essential for guiding clinical decisions. Fifth, the datasets used in this study are derived primarily from adult colonoscopy examinations. Data on pediatric polyps, polyps associated with inflammatory bowel disease, and other special pathological types are not represented. The generalizability of the model to these populations remains untested. Sixth, while ablation experiments and qualitative visualizations are provided to illustrate the function of each module, the interpretability of the model remains limited. The decision-making process of deep learning models is not fully transparent, which may affect clinician trust and adoption. Future work could incorporate gradient-based visualization techniques to provide more intuitive explanations of model predictions43.
Despite the limitations noted above, the PWD-Net protocol provides a reproducible framework for polyp segmentation that may serve as a foundation for further development. Potential directions include: extending the model to video-based colonoscopy analysis by incorporating temporal modeling techniques; adding a classification branch for end-to-end segmentation and pathological typing; expanding evaluation to larger and more diverse multi-center datasets; and exploring integration within endoluminal robotic platforms, where AI-assisted image analysis is increasingly recognized as a key enabling technology44,45. The supplementary code package provided with this protocol is intended to facilitate reproduction and adaptation of the method by other research groups.
The authors have nothing to disclose.
This study was funded by the National Key R&D Program of China (Program Nos. 2022YFC3500200 and 2022YFC3500204).
| Name | Company | Catalog Number | Comments |
|---|---|---|---|
| Adam Optimizer | — | — | Included in PyTorch |
| Albumentations | Albumentations Team | v1.0+ | Data augmentation library |
| CUDA Toolkit | NVIDIA | v11.3+ | GPU acceleration |
| Kvasir-SEG dataset | SimulaMet | — | https://datasets.simula.no/kvasir-seg/ |
| Matplotlib | Matplotlib Community | v3.4+ | Visualization of training curves |
| NumPy | NumPy Community | v1.21+ | Numerical computation |
| NVIDIA Tesla P100 | NVIDIA | P100-PCIE-16GB | GPU for training and inference |
| OpenCV | OpenCV Community | v4.5+ | Image preprocessing |
| Python | Python Software Foundation | v3.8+ | Programming language |
| PyTorch | Meta Platforms | v1.12+ | Deep learning framework |
| ResNet-50 pretrained weights | PyTorch Model Zoo | — | ImageNet-1K pretrained |
| Ubuntu | Canonical | 18.04+ | Operating system |
Request permission to reuse the text or figures of this JoVE article
Request Permission