Polyp Segmentation Network Based on Pinwheel Convolution and Dual Attention for Colorectal Precancerous Lesion Diagnosis

Ning Du; Xinqi Liu; Li Ji; Chuijie Wang

doi:10.3791/71178

Method Article

Polyp Segmentation Network Based on Pinwheel Convolution and Dual Attention for Colorectal Precancerous Lesion Diagnosis

DOI:

10.3791/71178

⸱

June 26th, 2026

Ning Du*¹ , Xinqi Liu*¹ , Li Ji² , Chuijie Wang³

¹National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, ²Xinglin College of Liaoning University of Traditional Chinese Medicine, ³Liaoning University of Traditional Chinese Medicine Affiliated Hospital

Summary

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

This protocol implements a U-shaped deep learning network integrating pinwheel convolution, dual attention, and multi-scale fusion to segment colorectal polyps.

Abstract

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Accurate segmentation of colorectal polyps is crucial for the early prevention and diagnosis of colorectal cancer. However, due to the high heterogeneity of polyps in terms of shape, size, and texture, as well as the complexity of the intestinal environment (such as folds, specular reflections, and fecal residues), existing methods still face significant challenges in boundary localization and small-polyp detection. To address these issues, this paper proposes a Polyp Segmentation Network based on Pinwheel Convolution and Dual Attention (PWD-Net). The proposed network adopts a U-shaped encoder–decoder architecture, where a pretrained ResNet is employed as the encoder to extract multi-level local features. Specifically, a Pinwheel Convolution Module (PCM) is introduced at the bottleneck layer to capture the global geometric structure and multi-directional contextual information of polyps through multi-angle rotated convolution kernels. A Dual-Attention Mechanism (DAM) that integrates channel attention and spatial attention is designed to adaptively suppress background noise and enhance polyp-region features. In addition, a Multi-scale Feature Fusion (MSF) strategy is employed to combine deep semantic information with shallow boundary details, ensuring both completeness and precision of segmentation results. Experiments conducted on the Kvasir-SEG and CVC-ClinicDB datasets demonstrate that PWD-Net achieves average Dice coefficients of 0.865 and 0.944, and IoU scores of 0.765 and 0.892, respectively, significantly outperforming existing state-of-the-art methods. Ablation studies verify the effectiveness of each module, and cross-dataset evaluations confirm the strong generalization ability of the model. This study provides a high-precision and robust solution for clinical polyp segmentation, offering significant value for the early diagnosis of colorectal precancerous lesions and supporting computer-aided intervention.

Introduction

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Colorectal cancer is one of the most common malignant tumors worldwide, with consistently high incidence and mortality rates. Studies have shown that most colorectal cancers develop from adenomatous polyps, a process that typically takes 10–15 years, providing a valuable time window for early detection and intervention. An increase of 1% in the adenoma detection rate (ADR) can reduce the risk of colorectal cancer by approximately 3%, significantly lowering patient mortality¹. Colonoscopy, regarded as the gold standard for colorectal cancer screening, enables direct removal of polyps during examination, thereby effectively reducing cancer incidence and mortality.

However, conventional colonoscopy heavily depends on the experience and skill level of endoscopists. Factors such as subjective judgment, visual fatigue, and distraction may lead to a miss rate of 20%–30%, which directly affects screening effectiveness². Therefore, developing computer-aided detection (CAD) systems for automatic segmentation of colorectal polyps holds considerable importance for improving ADR and reducing missed diagnoses. Recent clinical surveys have further highlighted the interest in integrating artificial intelligence into endoscopic lesion assessment workflows, reinforcing the need for robust and reproducible segmentation methods³.

In recent years, deep learning has achieved remarkable progress in medical image analysis, particularly convolutional neural networks (CNNs), which demonstrate strong capability in feature extraction and representation for image segmentation tasks⁴. As a classical medical image segmentation model, U-Net employs a symmetric encoder–decoder architecture and skip connections to achieve accurate pixel-level segmentation, becoming a benchmark in this field⁵. Building upon U-Net, many improved architectures have been proposed to address complex medical image segmentation tasks. UNet++ reduces the semantic gap between encoder and decoder feature maps by introducing nested and dense skip connections⁶. ResUNet++ integrates residual blocks, squeeze-and-excitation modules, dilated convolutions, and attention mechanisms, achieving strong performance in polyp segmentation⁷. U²-Net adopts a two-level nested U-shaped structure to capture multi-scale feature information⁸. More recently, a dual encoder-decoder-based deep polyp segmentation network has been proposed, leveraging parallel encoding and decoding paths to further enhance segmentation accuracy⁹.

Meanwhile, the introduction of attention mechanisms provides new solutions for feature enhancement and noise suppression. Attention U-Net employs attention gates to focus on target regions while suppressing irrelevant background information¹⁰. The Dual Attention Network (DANet) adaptively weights features from both channel and spatial dimensions¹¹, improving the perception of critical features. Triple Attention Networks (TANet) further enhance segmentation performance through adaptive selection of multi-scale features¹².

With the success of Transformer architectures in natural language processing and computer vision¹³, researchers have begun exploring their application in medical image segmentation. TransUNet was the first to employ a Transformer as an encoder to model long-range dependencies effectively¹⁴. Swin-UNet adopts a pure Transformer architecture and achieves efficient global information aggregation through a shifted-window mechanism¹⁵. UTNet proposes a hybrid architecture that combines the local feature extraction capability of CNNs with the global modeling ability of Transformers¹⁶.

In the field of polyp segmentation, Polyp-PVT utilizes a pyramid vision Transformer to capture multi-scale global semantic information¹⁷, while multi-scale nested UNet enhances contextual understanding by integrating Transformers¹⁸. Recent studies have also explored negative correlation learning strategies for cross-domain polyp segmentation¹⁹, Gompertz-augmented segmentation enhancement²⁰, and attention-based architectures incorporating boundary guidance²¹. Although these approaches improve segmentation performance to some extent, polyp segmentation still faces several challenges. First, polyps exhibit high heterogeneity in morphology, size, and texture, ranging from micro-polyps smaller than 5 mm to large polyps exceeding 30 mm, with shapes varying from circular and elliptical to highly irregular forms. Second, the intestinal environment is complex and variable, where mucosal folds, specular reflections, fecal residues, and food debris introduce severe background interference. Third, many polyps have blurred boundaries, may be partially occluded by folds, or submerged in intestinal fluids, making precise boundary localization extremely challenging²².

Existing methods still present clear limitations in addressing these challenges. Traditional CNNs are effective at extracting local texture and edge features; however, fixed square convolution kernels are not well suited to capturing diverse geometric shapes²³, especially for highly irregular polyps, and cannot effectively model multi-directional geometric features. Transformer-based methods can model global dependencies but are less effective at capturing fine local details and boundary information. Moreover, their high computational complexity makes them less suitable for real-time clinical applications²⁴. Recent polyp segmentation approaches such as PraNet, which uses reverse attention modules to refine key regions²⁵, boundary-guided cascade attention networks that enhance boundary feature extraction²⁶, and CAFE-Net, which fuses encoder and decoder features through cross-attention mechanisms²⁷, still encounter insufficient feature representation and inaccurate boundary localization when dealing with small polyps²⁸, blurred boundaries, and complex backgrounds. Furthermore, most methods neglect geometric morphology and fail to fully exploit multi-directional contextual information, resulting in suboptimal segmentation of irregularly shaped polyps.

In summary, current CNN-based methods lack the ability to capture multi-directional geometric features due to their reliance on fixed square convolution kernels. Transformer-based approaches offer global modeling but sacrifice local boundary precision and impose high computational costs. Meanwhile, existing attention-enhanced and multi-scale fusion strategies have not been jointly optimized within a unified framework specifically tailored for polyp segmentation²⁹. These gaps motivate the development of a method that simultaneously addresses geometric feature modeling, adaptive noise suppression, and cross-scale feature integration.

To address these issues, this protocol presents a Polyp Segmentation Network based on Pinwheel Convolution and Dual Attention (PWD-Net). The proposed network integrates geometric feature modeling, multi-dimensional attention enhancement, and multi-scale feature fusion, enabling precise segmentation of complex polyps. The main contributions of this work are summarized as follows: the pinwheel convolution module (PCM), inspired by the structure of a pinwheel, a novel rotated convolution kernel design is proposed that captures multi-directional geometric features of polyps through convolution operations at multiple angles (0°, 45°, 90°, 135°, 180°, 225°, 270°, and 315°). This module replaces the conventional convolution layer at the bottleneck stage, enabling effective perception of diverse edge orientations and significantly improving the representation of irregularly shaped polyps. The dual-attention mechanism (DAM) addresses background noise such as folds, reflections, and fecal residues in colonoscopy images. A dual-attention module integrating channel attention and spatial attention is designed. Embedded within skip connections, this module adaptively suppresses background interference and enhances feature responses in polyp regions by jointly identifying "what" is important (channel dimension) and "where" the target is located (spatial dimension), ensuring that only refined features are involved in subsequent fusion. The multi-scale feature fusion strategy (MSF) preserves both deep semantic information and shallow boundary details through a hierarchical mechanism introduced in the decoder. By progressively integrating DAM-enhanced encoder features with upsampled decoder features, this strategy effectively compensates for spatial detail loss caused by downsampling, enabling accurate detection of small polyps and precise boundary delineation.

Protocol

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

This study uses only publicly available, anonymized colonoscopy image datasets (Kvasir-SEG). No new human subject data were collected. Institutional ethics approval and informed patient consent were not required, as confirmed by the institutional review policies for retrospective analyses of de-identified public datasets.

1. Data Preparation

Download the Kvasir-SEG dataset from the official repository33 (https://datasets.simula.no/kvasir-seg/). The dataset contains 1,000 polyp images with corresponding pixel-level ground-truth masks.
Randomly split the dataset into training (800 images), validation (100 images), and test (100 images) sets with a ratio of 8:1:1 using a fixed random seed (seed = 42). Verify that no images overlap across the three subsets to prevent data leakage.
Resize all images and corresponding masks to 352 x 352 pixels using bilinear interpolation for images and nearest-neighbor interpolation for masks.
Normalize pixel values to [0, 1] by dividing by 255, then apply ImageNet channel-wise mean subtraction (0.485, 0.456, 0.406) and standard deviation normalization (0.229, 0.224, 0.225).
Apply the following augmentation transforms to the training set only (not to the validation or test sets): random horizontal flip (probability = 0.5); random vertical flip (probability = 0.5); random rotation (range: −30° to +30°, probability = 0.5); random multi-scale resizing (scale factor: 0.75 to 1.25, probability = 0.5)
NOTE: Apply identical spatial transforms to both the image and its corresponding mask to maintain alignment. Verify augmentation correctness by visually inspecting several augmented image–mask pairs before initiating training.

2. Overall Architecture

NOTE: Refer to Figure 1 for the macro-level encoder–decoder backbone of PWD-Net, and to Figure 2 for the integration and interaction of core modules within the feature flow. The overall architecture follows a U-shaped encoder–decoder design to handle scale variations of polyps and background interference in colonoscopy images.

Backbone and Encoding Path (Figure 1)
1. Employ a ResNet-50 pretrained on ImageNet (sourced from the official PyTorch model zoo) as the backbone encoder30. Fine-tune all encoder layers during training.
2. Feed the input colonoscopy image (resized to 352 x 352 pixels) through five stages of residual convolutional blocks to extract hierarchical features. The spatial resolution of feature maps is progressively downsampled from to across the five stages, while the channel dimensions increase correspondingly (64 → 128 → 256 → 512 → 1024).
3. At the bottleneck (the deepest encoder layer), replace the standard convolutional layer with the Pinwheel Convolution Module (PCM, described in Section 3) to capture the global geometric morphology and multi-directional contextual information at low resolution.
  NOTE: The five encoder stages correspond to the standard ResNet-50 layer groups: conv1, layer1, layer2, layer3, and layer4. Pretrained weights provide robust low-level and mid-level feature initialization, reducing convergence time on small medical datasets.
Key Components and Feature Interaction (Figure 2 and Figure 3)
1. Apply the Dual-Attention Mechanism (DAM, described in Section 4) to the output of each encoder stage before transmitting it to the decoder via skip connections. This step adaptively suppresses background noise generated by intestinal folds and specular reflections, while boosting the feature response in polyp regions. Only the filtered features are passed to the corresponding decoder layer.
2. In the decoder, progressively restore spatial resolution through bilinear upsampling. At each decoder layer, concatenate the upsampled features from the preceding decoder stage with the DAM-enhanced encoder features of the same spatial resolution.
3. Apply two consecutive convolutional layers (each followed by batch normalization and ReLU activation) to fuse the multi-scale information. This constitutes the Multi-scale Feature Fusion (MSF) strategy described in Section 5.
  NOTE: The decoder proceeds from deep to shallow layers (stage 5 → stage 1), ensuring that deep semantic localization information and shallow boundary detail information are effectively integrated at each level.
Output Generation
1. Apply a convolutional layer followed by a Sigmoid activation function to the final decoder output to generate the prediction mask.
2. Binarize the prediction mask using a threshold of 0.5 to obtain the final segmentation result, where pixels with predicted probability ≥ 0.5 are classified as polyp and the remaining pixels as background.

3. Pinwheel Convolution Module (Figure 3)

The Pinwheel Convolution Module (PCM) replaces the standard bottleneck convolution to capture multi-directional geometric features of polyps. Implement this module as follows:
1. Define a base convolution kernel W of size 3 x 3 with C_in input channels and C_out output channels.
2. Define the set of rotation angles Θ = {0°, 45°, 90°, …, 315°}. For each angle θ ∈ Θ, generate the rotated kernel W_θ by applying bilinear interpolation-based rotation to W. All eight rotated kernels share the same base parameters; only the spatial arrangement of weights differs.
3. For each angle θ, compute the direction-specific feature map:
  $Convolution equation $Y_a = \text{Conv}(X, W_a)$. Mathematical formula in research context.$
  where X is the input feature map.
4. Aggregate the eight directional feature maps by channel-wise concatenation along the channel axis, producing a tensor of dimension (8 x C_out) x H x W. Then apply a 1 x 1 convolution to reduce the channel dimension back to C_out, followed by batch normalization and ReLU activation³¹:
  $Equation for data aggregation process, $Y_{\text{out}} = F_{\text{agg}}({Y_\theta|\theta \in \Theta})$.$
  NOTE: The rotation and interpolation are performed on the kernel weights, not on the input feature map. This design enables parameter-efficient multi-directional feature extraction without increasing the input resolution. In the current implementation, C_in = 1024 and C_out = 1024 at the bottleneck stage, matching the output channel dimension of the ResNet-50 layer4. Refer to the supplementary code package for the complete implementation.

4. Dual-Attention Mechanism (Figure 4)

NOTE: The Dual-Attention Mechanism (DAM) is embedded within each skip connection to suppress background noise and enhance polyp-region features from both channel and spatial dimensions.

Channel Attention
The channel attention branch identifies which feature channels are most informative. Given an input feature F ∈ ℝ^C×H×W:
1. Compress the spatial dimensions via Global Average Pooling to obtain a channel descriptor z ∈ ℝ^C×1×1.
2. Pass z through a two-layer MLP (fully connected layers) with a reduction ratio r = 16. The first layer reduces the dimension from C to C/16 with ReLU activation; the second layer restores it from C/16 to C with Sigmoid activation to produce the channel weight vector A_c:
  
  where δ denotes ReLU and σ denotes Sigmoid.
Spatial Attention
The spatial attention branch locates where the target regions are:
1. Apply both max pooling and average pooling along the channel dimension to generate two 2D feature maps of size 1 x H x W.
2. Concatenate the two maps along the channel axis to form a 2 x H x W tensor. Apply a 7 x 7 convolutional layer followed by Sigmoid activation to produce the spatial weight map A_s ∈ ℝ^1×H×W:
Feature Fusion
1. Fuse the channel and spatial attention outputs with the input feature through element-wise multiplication:
  
  where α and β are learnable balance coefficients, both initialized to 0.5 and updated jointly with the network parameters via gradient-based optimization during training.
  NOTE: Refer to the supplementary code package (dam_module.py) for the complete implementation.

5. Multi-scale Feature Fusion

Apply the multi-scale feature fusion (MSF) strategy in the decoder to address spatial detail loss in deep features. At each decoder stage, perform the following:
Upsample the feature map from the preceding decoder stage by a factor of 2 using bilinear interpolation.
Concatenate the upsampled features with the DAM-enhanced encoder features of the corresponding spatial resolution along the channel axis.
Apply two consecutive 3 x 3 convolutional layers (each followed by batch normalization and ReLU activation³²) to fuse the concatenated features.
NOTE: This cross-level fusion ensures that the boundary details of polyps (provided by shallow encoder features) and semantic localization (provided by deep features) are simultaneously preserved, generating fine-grained segmentation results.

6. Loss Function and Training Configuration

Loss Function
1. A hybrid loss function L_total is adopted to jointly optimize the network, addressing the ubiquitous foreground–background class imbalance in polyp segmentation.
  Binary Cross-Entropy Loss(L_BCE) measures the pixel-level classification accuracy:
  
  where N is the total number of pixels, y_i ∈ {0,1} is the ground-truth label, and ŷ_i ∈ [0,1] is the predicted probability.
2. Dice Loss (L_Dice) quantifies the set similarity between the predicted and ground-truth regions:
  $Dice loss function equation, $L_{\text{Dice}}$, for machine learning model accuracy evaluation.$
  
  where ε is a smoothing factor (set to 1 x 10⁻⁵) to avoid division by zero.
  Set λ = 0.5 to balance the contributions of the two loss terms.
Training Configuration
1. Initialize the encoder with ImageNet-pretrained ResNet-50 weights. Initialize all decoder layers, PCM, and DAM parameters using Kaiming uniform initialization.
2. Configure the optimizer and training schedule as follows. Use the Adam optimizer with β₁ = 0.9 and β₂ = 0.999. Set the initial learning rate to 1 x 10⁻⁴. Apply a cosine annealing learning rate schedule with T_max = 50 and η_min = 1 x 10⁻⁶. Use a batch size of 16 and train the model for 50 epochs.
3. Train the model for 50 epochs on the training set (800 images). At the end of each epoch, evaluate the model on the validation set (100 images) using the Dice coefficient as the primary monitoring metric.
4. Save the model checkpoint that achieves the highest Dice coefficient on the validation set. Use this checkpoint as the final model for all subsequent evaluation on the test set.
  NOTE: Early stopping is not explicitly applied. The best-validation-Dice checkpoint selection strategy serves as the model selection criterion. All experiments are conducted using the hardware and software environment specified in the Table of Materials. Training for 50 epochs on 800 images takes approximately 2 h under the described configuration. All reported results are obtained from a single training run using the specified random seed (seed = 42). Refer to the supplementary code package for the complete training script.

7. Pseudocode

Use Algorithm 1 as the complete workflow map for PWD Net. Match the PCM, DAM, main architecture, and training pipeline blocks in the algorithm with the corresponding files in the supplementary code package.
Implement the PCM block shown in Lines 4 to 12. Define a base 3 x 3 convolution kernel and generate eight rotated kernels at 0°, 45°, 90°, 135°, 180°, 225°, 270°, and 315° using bilinear interpolation.
Keep the same learnable base parameters for all rotated PCM kernels. For each rotation angle, compute one direction specific feature map.
Concatenate the eight PCM feature maps along the channel dimension. Apply a 1 x 1 convolution, batch normalization, and ReLU activation to restore the original channel dimension.
Implement the DAM block shown in Lines 14 to 19. Apply Global Average Pooling to generate the channel descriptor, then pass it through a two layer MLP with a reduction ratio of 16 to obtain channel weights.
Generate the spatial attention map by applying channel wise average pooling and max pooling to the input feature. Concatenate the two maps and process them with a 7 x 7 convolution followed by Sigmoid activation.
Fuse the DAM channel and spatial attention outputs with the input feature using element wise multiplication. Weight the two attention maps with learnable coefficients α and β, both initialized to 0.5.
Build the main PWD Net architecture shown in Lines 21 to 32. Pass the input image through five stages of a pretrained ResNet 50 encoder to obtain e1 to e5, with spatial resolution decreasing from H x W to H/32 x W/32.
Apply PCM to e5 at the bottleneck. Apply DAM to e1 to e4 before sending these features to the decoder through skip connections.
Decode the feature map from deep to shallow layers. At each decoder level, upsample the previous feature, concatenate it with the corresponding DAM enhanced encoder feature, and apply DoubleConv for feature fusion.
Generate the segmentation output with a 1 x 1 convolution followed by Sigmoid activation. Use the resulting pixel wise probability map as the predicted mask.
Implement the training loop shown in Lines 34 to 39. In each epoch, run forward propagation through PWD Net and compute the predicted mask.
Compute the training loss as 0.5 x BCE loss plus 0.5 x Dice loss. Update all learnable parameters with the Adam optimizer through backpropagation.

Algorithm 1: PWD-Net Polyp Segmentation
1: Input: Colonoscopy image I ∈ ℝ^H×W×3
2: Output: Segmentation mask M ∈ {0,1}^(H×W)
3:
4: function PCM(X) ▷ Pinwheel Convolution Module
5: Define base kernel W (3 x 3), angles Θ = {0°, 45°, ..., 315°}
6: for each θ ∈ Θ do
7: W_θ ← BilinearRotate(W, θ) ▷ Rotate kernel
8: Y_θ ← Conv2d(X, W_θ) ▷ Direction-specific features
9: end for
10: Y_out ← ReLU(BN(Conv1 x 1(Concat({Y_θ})))) ▷ Aggregate
11: return Y_out
12: end function
13:
14: function DAM(F) ▷ Dual-Attention Mechanism
15: A_c ← Sigmoid(MLP(AvgPool(F))) ▷ Channel attention (r=16)
16: A_s ← Sigmoid(Conv7 x 7([AvgPool(F); MaxPool(F)])) ▷ Spatial attention
17: F' ← F ⊗ (α · A_c + β · A_s) ▷ Fuse with learnable α, β (init=0.5)
18: return F'
19: end function
20:
21: function PWD-Net(I)
22: Encoder: e₁, e₂, e₃, e₄, e₅ ← ResNet50_Stages(I) ▷ 5-stage pretrained encoder
23: Bottleneck: b ← PCM(e₅) ▷ Apply PCM at bottleneck
24: Skip connections: s_i ← DAM(e_i) for i = 1, 2, 3, 4 ▷ Filter encoder features
25: Decoder:
26: d₄ ← DoubleConv(Concat(Up(b), s₄))
27: d₃ ← DoubleConv(Concat(Up(d₄), s₃))
28: d₂ ← DoubleConv(Concat(Up(d₃), s₂))
29: d₁ ← DoubleConv(Concat(Up(d₂), s₁))
30: M ← Sigmoid(Conv1 x 1(d₁))
31: return M
32: end function
33:
34: Training:
35: for each epoch do
36: M̂ ← PWD-Net(I)
37: ℒ ← 0.5 · BCE(M̂, M_gt) + 0.5 · DiceLoss(M̂, M_gt) ▷ λ = 0.5

38: Update parameters via backpropagation (Adam optimizer)
39: end for

Results

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Experimental Setup
Dataset

The Kvasir SEG dataset was used to evaluate the segmentation behavior of PWD Net on colonoscopy images with heterogeneous polyp appearances. The dataset contains 1,000 pixel annotated polyp images and includes variation in polyp size, shape, texture, illumination, and background complexity, making it suitable for assessing small target detection, boundary localization, and robustness to visual interference. The dataset was divided into training, validation, and test subsets, and the final test set was used only for performance evaluation. The distribution of images is summarized in Table 1.

Implementation Details

The implementation settings required for reproducibility are summarized in Table 2, and the complete procedural details are provided in the Data Preparation steps and Section 5.2 of the Protocol. For interpreting the results, all reported experiments used the same input resolution, hardware environment, and evaluation conditions listed in the Table of Materials. The reported values are based on the selected validation Dice checkpoint from a single run using seed = 42, so the results should be interpreted as performance under a fixed experimental split rather than as averaged cross-validation outcomes.

Evaluation Metrics

Segmentation performance was evaluated using the Dice coefficient, Intersection over Union, pixel-level accuracy, and inference speed. Dice coefficient and Intersection over Union were used as the primary overlap-based metrics because they directly reflect agreement between the predicted mask and the expert-annotated polyp region. Pixel-level accuracy was reported as a supplementary measure because colonoscopy images often contain large background regions. Inference speed, reported as frames per second, was included to assess whether the model maintains practical computational efficiency while improving segmentation quality.

Comparison with Existing Methods
To demonstrate the behavior and effectiveness of PWD-Net, a comparison is conducted with five representative polyp segmentation methods: CBSA (Channel-Boosted Spatial Attention network)³⁴, FSSA (Feature-Shared Spatial Attention network), MSF (Multi-Scale Fusion network), Pinwheel-Conv (Pinwheel Convolution baseline without attention or fusion modules), and PolaLinear (Polarized Linear attention network). All comparison methods are reimplemented using their officially released source codes and trained on the same Kvasir-SEG training set (800 images) under identical preprocessing, input resolution (352 x 352), and evaluation settings to ensure a fair comparison. Table 3 presents the quantitative results on the test set.

As shown in Table 3, PWD-Net achieves a Dice coefficient of 0.865 and an IoU of 0.765, representing improvements of 1.8% in Dice and 4.8% in IoU compared with the next-best method (CBSA). Notably, PWD-Net achieves this with 9.1M parameters, compared with 18.4M for CBSA, indicating favorable efficiency. While PolaLinear and Pinwheel-Conv offer faster inference speeds (79 and 72 FPS, respectively), their segmentation accuracy is noticeably lower, suggesting that PWD-Net provides a reasonable balance between accuracy and computational cost for the evaluated dataset. To illustrate the qualitative segmentation behavior, five representative test samples covering small polyps, large polyps, complex backgrounds, and blurred boundaries are selected for visual comparison. Figure 5 presents the segmentation results of four selected comparator methods (CBSA, FSSA, MSF, and PWD-Net) alongside the ground truth. Each prediction column is labeled with the corresponding method name. Pinwheel-Conv and PolaLinear are omitted from this figure for visual clarity, as their quantitative performance is substantially lower; this figure therefore represents a selected subset of the methods compared in Table 3.

As shown in Figure 5, in small-polyp scenarios (first and fifth rows), FSSA and MSF exhibit missed detections, whereas PWD-Net captures the targets more completely. In large-polyp scenarios (second and third rows), CBSA and FSSA produce noticeable boundary irregularities, while PWD-Net generates smoother boundaries. In the blurred-boundary scenario (fourth row), PWD-Net demonstrates effective suppression of background noise via the dual-attention mechanism.

Ablation Study
To analyze the contribution of each core component in PWD-Net, a systematic ablation study is conducted. Using ResNet-50 as the backbone encoder to form the baseline model, the Pinwheel Convolution Module (Pinwheel), Dual-Attention Mechanism (Dual-Attn), and Multi-Scale Feature Fusion (MSF) module are incrementally incorporated. Table 4 summarizes the quantitative results.

The key findings from Table 4 can be summarized as follows. First, adding any single module improves the performance of the baseline model. The Dual-Attention Mechanism brings the most notable gains (Dice: +2.0%, IoU: +2.7%), supporting the effectiveness of adaptive noise suppression. The Pinwheel Convolution Module contributes a 1.6% improvement in Dice, indicating the benefit of multi-directional feature extraction for irregular polyp shapes. Second, combining the Pinwheel Convolution and Dual-Attention Mechanism further increases performance to Dice = 0.858 and IoU = 0.748, suggesting complementarity between the two modules. Finally, the complete PWD-Net (integrating all three modules) achieves the best observed performance (Dice = 0.865, IoU = 0.765), with improvements of 3.3% and 6.0%, respectively, compared with the baseline, demonstrating the contribution of each proposed component on this dataset.

Training Process Analysis
To illustrate the training dynamics and convergence characteristics of PWD-Net, key performance metrics are recorded and visualized over 50 training epochs. Figure 6 shows the variations of the loss function, Dice coefficient, IoU, and accuracy during training.

As shown in Figure 6(a), both the training loss and validation loss decrease rapidly within the first 10 epochs and then gradually stabilize. The validation loss remains slightly higher than the training loss throughout, but the two curves follow a consistent trend with a small gap, indicating that the model does not suffer from severe overfitting. Figure 6(b) shows that the Dice coefficient rises sharply in the early training stage, converges after approximately the 30th epoch, and stabilizes above 0.86. The IoU curve in Figure 6(c) exhibits a similar growth trend, reaching around 0.765 in the late training phase. Figure 6(d) indicates that accuracy converges above 94%. The stable validation trends in the middle and late training stages suggest that the adopted data augmentation strategy and cosine annealing schedule contribute to mitigating overfitting on this dataset.

Performance across Polyp Sizes
To further evaluate the applicability of PWD-Net across different clinical scenarios, the test set (100 images) is divided into three categories according to the ratio of polyp area to the total image area: small polyps (< 5%), medium polyps (5%–30%), and large polyps (> 30%). This classification reflects the influence of the polyp scale on segmentation difficulty. Table 5 presents the quantitative performance on each category. As shown in Table 5, PWD-Net achieves the best performance in the medium-polyp category (Dice = 0.882, IoU = 0.790), which is consistent with the larger representation of this category (54 out of 100 test images). Performance on large polyps remains at a comparable level (Dice = 0.861, IoU = 0.760). Performance on small polyps is relatively lower (Dice = 0.812, IoU = 0.685), primarily because small targets occupy a small proportion of the image and are more susceptible to background noise with sparser boundary information.

These results suggest that the multi-directional feature capture capability of the Pinwheel Convolution Module and the spatial localization ability of the Dual-Attention Mechanism contribute to maintaining reasonable segmentation quality across different polyp scales on the evaluated test set.

Deep learning CNN architecture diagram; input processing to feature recovery; neural network flow.
Figure 1: Framework of the PWD-Net Model. Overall structural framework of the proposed Polyp Segmentation Network based on Pinwheel Convolution and Dual Attention (PWD-Net), illustrating the encoder (ResNet-50), bottleneck (PCM), DAM-enhanced skip connections, MSF decoder, and output generation for colorectal polyp segmentation. Please click here to view a larger version of this figure.

Wavelet transformation and grouped convolution diagram; neural network layers and processing flow.
Figure 2: Overall Architecture Flowchart of PWD-Net. Detailed flowchart of the full PWD-Net architecture, showing the five-stage ResNet-50 encoder, PCM bottleneck, DAM skip connections, multi-scale feature fusion decoder, and final prediction generation. Please click here to view a larger version of this figure.

PConv diagram showing padded convolution layers, receptive field, and transformation equations.
Figure 3: Schematic Diagram of the Pinwheel Convolution Module. Structural and operational schematic of the Pinwheel Convolution Module, demonstrating multi-angle rotated convolution kernels , bilinear interpolation-based rotation, channel concatenation, and 1 x 1 convolution aggregation. Please click here to view a larger version of this figure.

Attention mechanism diagram with ResNet, position and channel modules for image processing.
Figure 4:Structure Diagram of the Dual Attention Mechanism. Architectural diagram of the DAM, showing the parallel channel attention branch (Global Average Pooling → MLP with reduction ratio r = 16 → Sigmoid) and spatial attention branch (channel-wise pooling → 7 x 7 convolution → Sigmoid), followed by weighted fusion with learnable coefficients α and β. Please click here to view a larger version of this figure.

Colon polyp segmentation; medical imaging analysis; original, ground truth, prediction comparisons.
Figure 5: Qualitative comparison of segmentation results. Each row represents a test sample. Columns from left to right: Input image, Ground Truth, CBSA, FSSA, MSF, and PWD-Net (Ours). Pinwheel-Conv and PolaLinear are omitted from this figure for visual clarity; see Table 3 for the complete quantitative comparison. Please click here to view a larger version of this figure.

Machine learning results; training vs validation; graphs of loss, IoU, Dice score, accuracy trends.
Figure 6: Training curves of PWD-Net over 50 epochs. (a) Training and validation loss. (b) Dice coefficient. (c) Intersection over Union (IoU). (d) Pixel-level accuracy. Please click here to view a larger version of this figure.

Training Subset	Number of Samples	Proportion
Train Set	800	80%
Validation Set	100	10%
Test Set	100	10%
Total Set	1000	100%

Table 1: Dataset Statistics. Dataset split distribution for the Kvasir-SEG dataset (1,000 images total), showing the number of images and proportion assigned to the training, validation, and test subsets (random seed = 42).

Category	Parameter Item	Parameter Setting
Deep Learning Framework	Framework	PyTorch
Hardware Environment	GPU	NVIDIA Tesla P100
Acceleration Method	GPU Acceleration	CUDA
Input Settings	Input Image Size	352 × 352
Image Format	Image Format	RGB Image
Optimizer	Optimizer	Adam
Initial Learning Rate	Initial LR	1 × 10⁻⁴
Batch Size	Batch Size	16
Training Epochs	Epochs	50
Loss Function	Loss Function	Dice Loss + BCE

Table 2: Experimental Parameter Settings. Experimental parameter settings for PWD-Net training and evaluation. Refer to the Data Preparation steps and Section 5.2 of the Protocol for the complete step-by-step implementation procedure.

Method	Dice ↑	IoU ↑	Accuracy ↑	Parameters (M) ↓	FPS ↑
CBSA	0.8466	0.717	0.9325	18.4	36
FSSA	0.7109	0.551	0.9012	9.8	61
MSF	0.7337	0.585	0.9086	11.5	54
Pinwheel-Conv	0.8007	0.6742	0.9401	7.9	72
PolaLinear	0.7213	0.5707	0.9113	6.6	79
PWD-Net (Ours)	0.865	0.7651	0.9478	9.1	63

Table 3: Quantitative Comparison Results. Quantitative comparison of PWD-Net with five existing polyp segmentation methods on the Kvasir-SEG test set (100 images). All methods are evaluated under identical data splits, preprocessing, and input resolution (352 x 352). ↑ indicates higher is better; ↓ indicates lower is better. Methods marked with * denote results cited from the original publication rather than reimplemented.

Configuration	Pinwheel	Dual-Attn	MSF	Dice ↑	IoU ↑
Baseline	×	×	×	0.832	0.705
+ Pinwheel	√	×	×	0.848	0.725
+ Dual-Attn	×	√	×	0.852	0.732
+ MSF	×	×	√	0.844	0.72
+ Pinwheel + Dual-Attn	√	√	×	0.858	0.748
Full (PWD-Net)	√	√	√	0.865	0.765

Table 4: Ablation Study Results. Ablation study results on the Kvasir-SEG test set, showing the incremental contribution of the Pinwheel Convolution Module (Pinwheel), Dual-Attention Mechanism (Dual-Attn), and Multi-Scale Feature Fusion (MSF) to the baseline ResNet-50 encoder.

Polyp Type	Number	Dice ↑	IoU ↑
Small Polyps（< 5%）	21	0.812	0.685
Medium Polyps（5%–30%）	54	0.882	0.79
Large Polyps（> 30%）	25	0.861	0.76

Table 5: Performance of PWD-Net on Different Polyp Types. Performance of PWD-Net on different polyp size categories within the Kvasir-SEG test set (100 images). Polyp size is defined by the ratio of polyp area to total image area.

Supplementary file: Compressed archive containing the implementation of the PWD-Net framework. The file includes model.py defining the network architecture with the Pinwheel Convolution Module (PCM) and Dual-Attention Mechanism (DAM), train.py implementing the data loading pipeline, loss function, and training procedure, test.py for model inference and evaluation on test datasets, and requirements.txt listing all required Python libraries and their corresponding versions.Please click here to download this file.

Discussion

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Several design choices in the PWD-Net protocol are critical for achieving reliable segmentation results and merit careful attention during implementation. First, the selection and initialization of the encoder backbone directly influence convergence behavior and final performance. The protocol employs a ResNet-50 encoder pretrained on ImageNet, which provides robust low-level and mid-level feature initialization. This is particularly important for medical image segmentation tasks where the available training data are limited (800 images in the present study). Fine-tuning all encoder layers, rather than freezing them, allows the network to adapt the pretrained features to the specific characteristics of colonoscopy images, such as mucosal textures and specular reflections. Second, the placement of each core module within the architecture is intentional. The Pinwheel Convolution Module (PCM) is positioned at the bottleneck, where spatial resolution is lowest but semantic information is richest, enabling efficient capture of global geometric patterns without excessive computational cost. The Dual-Attention Mechanism (DAM) is embedded in the skip connections rather than in the decoder, ensuring that background noise is suppressed before features are transmitted to the decoder, preventing contaminated features from propagating through the fusion stages. The ablation study (Table 4) supports this design: the DAM contributes the largest individual performance gain (Dice: +2.0%), confirming the importance of early noise suppression in the feature pipeline. Third, the hybrid loss function (0.5 · BCE + 0.5 · Dice) balances pixel-level classification accuracy with region-level overlap optimization. This combination is particularly relevant for polyp segmentation, where foreground-background class imbalance is common. The equal weighting (λ = 0.5) is adopted as a default; adjusting this ratio may be necessary for datasets with different class distributions (see Troubleshooting below).

Modifications and Troubleshooting
The following modifications and troubleshooting guidelines are provided for adapting the protocol to different experimental settings. When applying the protocol to datasets with different image resolutions or polyp size distributions, the input resolution (352 x 352) may need adjustment. Larger input sizes may improve small-polyp detection at the cost of increased memory consumption and reduced inference speed. If training loss does not converge within 50 epochs, consider reducing the initial learning rate (e.g., to 5 x 10⁻⁵) or increasing the cosine annealing cycle length. If the model exhibits high false-positive rates in regions with severe specular reflections or mucosal folds, increasing the weight of the Dice loss component (e.g., λ = 0.4 for BCE, 0.6 for Dice) may improve boundary precision at the expense of pixel-level accuracy. Conversely, if the model under-segments small polyps, increasing the BCE weight may help. The number of rotation angles in the PCM (currently eight, from 0° to 315° in 45° increments) represents a balance between directional coverage and computational cost. Reducing to four angles (0°, 90°, 180°, 270°) decreases computation but may reduce sensitivity to oblique polyp boundaries. The reduction ratio r = 16 in the channel attention branch of the DAM follows the convention established by prior squeeze-and-excitation networks³²; smaller ratios (e.g., r = 8) increase model capacity but may lead to overfitting on small datasets. For datasets significantly larger than Kvasir-SEG, consider increasing the batch size and training epochs accordingly, and monitor validation metrics to determine the appropriate stopping point.

Significance Relative to Alternative Methods
The PWD-Net architecture addresses specific limitations of existing approaches through three complementary modules. Compared with methods relying on standard square convolution kernels, the PCM provides directional sensitivity through multi-angle rotated kernels, enabling better adaptation to the irregular and diverse morphology of colorectal polyps. Compared with single-dimension attention mechanisms (e.g., channel-only attention in squeeze-and-excitation networks³³), the DAM jointly models channel and spatial importance, offering more comprehensive noise suppression in the complex colonoscopy environment. Compared with Transformer-based architectures such as TransUNet³⁴ and Polyp-PVT³⁵, which offer strong global modeling but at higher computational cost, PWD-Net achieves competitive performance with a relatively compact model size (9.1M parameters) and practical inference speed (63 FPS), as documented in Table 3.

It should be noted that the comparisons presented in this study (Table 3) are conducted under controlled conditions with identical data splits, preprocessing, and evaluation protocols. The performance differences observed are specific to the Kvasir-SEG test set (100 images) used in this study and may not directly generalize to other datasets or clinical settings. A broader comparison incorporating additional established baselines (e.g., PraNet³⁶, ResUNet++³⁷) under standardized multi-dataset benchmarks would further strengthen the evidence and is planned for future work. Recent work on dual encoder-decoder architectures for polyp segmentation³⁸ has demonstrated the potential of parallel encoding and decoding paths. The PWD-Net architecture differs by focusing on rotational geometric modeling and dual-attention filtering within a single encoder-decoder pipeline, representing a complementary design philosophy.

Several important limitations of this study should be acknowledged. First, regarding experimental scope, the current study reports results exclusively on the Kvasir-SEG dataset with a single random split of 800 training, 100 validation, and 100 test images. The test set size (100 images) is relatively small, and only a single training run is reported without repeated experiments or cross-validation. Consequently, the reported performance metrics may be subject to variance related to the specific data split. Future work should incorporate k-fold cross-validation or multiple random splits with reported standard deviations to provide more robust performance estimates. Second, the PCM introduces additional computational overhead through multi-angle kernel rotation and aggregation. Although the overall model remains compact (9.1M parameters), deployment on resource-constrained devices in clinical environments may require further optimization through techniques such as knowledge distillation or model pruning. Third, the model is trained and evaluated exclusively on static images, whereas clinical colonoscopy involves real-time video streams in which polyp appearance, size, and viewpoint change dynamically across consecutive frames. Although the inference speed of 63 FPS is compatible with real-time frame rates, this metric alone does not constitute clinical validation. Prospective validation on endoscopic video data, reader studies, and downstream clinical endpoint analyses would be necessary before any claims of clinical readiness can be made³⁹^,⁴⁰^,⁴¹. The current work should be understood as a methodological contribution rather than a clinically validated system.

Fourth, the clinical translation pathway for AI-assisted polyp segmentation extends well beyond segmentation accuracy. Recent reviews have highlighted that advanced imaging and analysis tools must be integrated within broader endoluminal workflows, including lesion classification, staging, and treatment planning. The current protocol focuses exclusively on binary polyp segmentation and does not address pathological⁴² classification (e.g., adenomatous vs. hyperplastic polyps) or malignancy risk assessment, which are essential for guiding clinical decisions. Fifth, the datasets used in this study are derived primarily from adult colonoscopy examinations. Data on pediatric polyps, polyps associated with inflammatory bowel disease, and other special pathological types are not represented. The generalizability of the model to these populations remains untested. Sixth, while ablation experiments and qualitative visualizations are provided to illustrate the function of each module, the interpretability of the model remains limited. The decision-making process of deep learning models is not fully transparent, which may affect clinician trust and adoption. Future work could incorporate gradient-based visualization techniques to provide more intuitive explanations of model predictions⁴³.

Despite the limitations noted above, the PWD-Net protocol provides a reproducible framework for polyp segmentation that may serve as a foundation for further development. Potential directions include: extending the model to video-based colonoscopy analysis by incorporating temporal modeling techniques; adding a classification branch for end-to-end segmentation and pathological typing; expanding evaluation to larger and more diverse multi-center datasets; and exploring integration within endoluminal robotic platforms, where AI-assisted image analysis is increasingly recognized as a key enabling technology⁴⁴^,⁴⁵. The supplementary code package provided with this protocol is intended to facilitate reproduction and adaptation of the method by other research groups.

Disclosures

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The authors have nothing to disclose.

Acknowledgements

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

This study was funded by the National Key R&D Program of China (Program Nos. 2022YFC3500200 and 2022YFC3500204).

Materials

List of materials used in this article
Name	Company	Catalog Number	Comments
Adam Optimizer	—	—	Included in PyTorch
Albumentations	Albumentations Team	v1.0+	Data augmentation library
CUDA Toolkit	NVIDIA	v11.3+	GPU acceleration
Kvasir-SEG dataset	SimulaMet	—	https://datasets.simula.no/kvasir-seg/
Matplotlib	Matplotlib Community	v3.4+	Visualization of training curves
NumPy	NumPy Community	v1.21+	Numerical computation
NVIDIA Tesla P100	NVIDIA	P100-PCIE-16GB	GPU for training and inference
OpenCV	OpenCV Community	v4.5+	Image preprocessing
Python	Python Software Foundation	v3.8+	Programming language
PyTorch	Meta Platforms	v1.12+	Deep learning framework
ResNet-50 pretrained weights	PyTorch Model Zoo	—	ImageNet-1K pretrained
Ubuntu	Canonical	18.04+	Operating system

Reprints and Permissions

Request permission to reuse the text or figures of this JoVE article

Request Permission

Polyp Segmentation Network Based on Pinwheel Convolution and Dual Attention for Colorectal Precancerous Lesion Diagnosis

In This Article

Summary

Abstract

Introduction

Protocol

Results

Discussion

Disclosures

Acknowledgements

Materials

Reprints and Permissions

Tags

Related Articles