$$\rightleftharpoonup{xx}$$
$$\longleftharp{xx}$$,
$$\longrightharp{xx}$$,
1. Dataset characteristics
The study utilized the publicly available Breast Ultrasound Images (BUSI) dataset, which provided a well-documented repository for the analysis and classification of breast cancer29. This dataset proved particularly appropriate as it incorporated a vast collection of ultrasound images required to establish and validate deep learning models for medical imaging. The dataset comprised 780 ultrasound images collected from 600 women aged 25–75 years, representing a diverse population for breast cancer diagnosis. All images maintained an average resolution of 500 x 500 pixels and utilized the PNG format to preserve high-quality visual data for analysis. Figure 1 illustrates the overall workflow of the proposed framework, including preprocessing, augmentation, EfficientNet-B0 feature extraction, classification, and Grad-CAM interpretability.

Figure 1: Representative samples from the BUSI dataset. (A) Benign, (B) Malignant, and (C) Normal images. Panels illustrate morphological diversity used for training. Please click here to view a larger version of this figure.
2. Experimental workflow
To ensure reproducibility, the investigation executed the processing pipeline in sequential stages. First, the system loaded the BUSI ultrasound images and corresponding ground-truth masks and resized them to 224×224 pixels, followed by ImageNet normalization. Second, the framework overlaid each image with its mask to emphasize lesion regions before splitting the dataset into training (70%), validation (15%), and testing (15%) subsets using stratified sampling. Third, the study applied targeted augmentation consisting of horizontal flipping, rotation (±15°), and color jittering primarily to the minority classes. Fourth, the researchers fine-tuned the EfficientNet-B0 model, which was pretrained on ImageNet, by replacing the final classification layer with a three-class output layer. The training process employed the Adam optimizer (learning rate 0.00005), step learning-rate decay (γ = 0.1 every 7 epochs), cross-entropy loss, a batch size of 8, and early stopping with a patience of 2. Finally, Grad-CAM generated attention heatmaps from the last convolutional layer to visualize diagnostically relevant regions and support clinical interpretation. Table of Materials lists the essential datasets, software libraries, and hardware configurations required to reproduce the study's breast ultrasound classification and interpretability framework.
The BUSI dataset exhibited a natural class imbalance, a characteristic frequently observed in medical datasets. The distribution favored the benign class, followed by malignant and normal cases. Specifically, the dataset contained 487 benign, 210 malignant, and 133 normal images. While data augmentation increased training variability, it did not alter the fundamental sample count in each class. This distribution reflected real-world clinical scenarios where benign conditions occur more frequently than malignant tumors. Although such an imbalance accurately portrayed the prevalence of breast abnormalities in clinical environments, it presented challenges for deep learning training, as it could lead to biased predictions and suboptimal performance for minority classes. Consequently, mitigating this imbalance became essential to formulate an effective and generalizable model that performed efficiently across every diagnostic category. Figure 2 shows representative ultrasound images from the dataset, illustrating examples of benign, malignant, and normal breast tissue samples.

Figure 2: Proposed model workflow. Sequential pipeline from input and mask-overlay preprocessing to EfficientNet-B0 training and Grad-CAM output. Please click here to view a larger version of this figure.
3. Data Preprocessing
The study resized all images to 224 x 224 pixels, adhering to the standard input dimensions for the EfficientNet-B0 architecture. This resizing ensured dataset consistency and reduced computational requirements, as defined by equation 1.
The framework normalized pixel values using the mean and standard deviation of the ImageNet dataset ([0.485, 0.456, 0.406] for mean and [0.229, 0.224, 0.225] for std). This normalization scaled the input data to enhance training stability and convergence speed, calculated according to equation 2.
To improve feature extraction and the identification of regions of interest, the researchers superimposed the original images onto their corresponding ground-truth masks. Figure 3 illustrates the ground-truth lesion masks and the resulting overlaid ultrasound images that highlighted tumor boundaries.

Figure 3: Mask and overlaid images. (A) Original ultrasound, (B) ground-truth mask, and (C) resulting overlay highlighting tumor boundaries for spatial attention. Please click here to view a larger version of this figure.
This overlay identified critical regions, such as tumor margins and texture patterns, providing the model with localized spatial context to improve classification accuracy, as shown in equation 3.
The study utilized the mask-overlay operation exclusively during the preparation of training data to supplement lesion boundary knowledge during feature learning. In contrast, the validation and inference phases employed only the original ultrasound images without masks, maintaining a conventional image-classification pipeline. Because the binary segmentation masks in the BUSI dataset represented structures already inherent in the ultrasound images, they did not introduce new class labels. To prevent information leakage, the investigation partitioned the dataset into training, validation, and testing sets, ensuring that no overlapping pairs of images or masks were shared across subsets. Consequently, the overlay functioned as a spatial attention guide that highlighted anatomically significant areas without compromising the independence of the evaluation data.
4. Data Augmentation
The study implemented targeted data augmentation to mitigate class imbalance within the training set, focusing specifically on the malignant and normal minority classes. The framework utilized a broad range of transformations to artificially increase the variability of the training data and improve the model's capacity to generalize across diverse imaging conditions.
First, the system performed horizontal flipping with a probability of 0.9 (90%), allowing the model to recognize features regardless of scanning direction. This transformation followed equation 4:
Second, the investigation applied random rotations within a range of ±15° to simulate variations in scanning angles and patient positioning. This process forced the model to learn rotation-invariant features, such as tumor margins and texture patterns, as defined by equation 5:
Third, the researchers utilized color jittering to simulate variations in lighting, contrast, and color balance. The framework adjusted brightness, contrast, and saturation by a factor of 0.2, and hue by 0.1. These stochastic adjustments improved the model's robustness to appearance changes, calculated according to equation 6:
The selection of these specific hyperparameter ranges (horizontal flip 0.9, rotation +15, -15, brightness/contrast/saturation 0.2, hue 0.1) resulted from an empirical stability assessment. The investigation determined that these settings optimized the balance between diversity generation and anatomical realism, ensuring steady convergence and stable validation accuracy without inducing unrealistic lesion deformations.
5. Model Architecture
The study utilized EfficientNet-B0, a modern CNN model renowned for its efficiency, scalability, and performance compared to other image classification architectures. This model employed a method of scaling the depth, width, and resolution of the network proportionally, which allowed for high accuracy with a significantly reduced number of parameters compared to standard architectures like ResNet and VGG. This scaling approach enabled the framework to handle high-resolution images while remaining computationally efficient, fitting the specific needs of medical imaging. Figure 4 illustrates the architecture of EfficientNet-B0, highlighting the sequence of convolutional and MBConv blocks used for hierarchical feature extraction.

Figure 4: EfficientNet-B0 architecture. Schematic of MBConv blocks and compound scaling factors used for hierarchical feature extraction. Please click here to view a larger version of this figure.
In order to modify EfficientNet-B0 for the breast ultrasound classification task, the researchers introduced specific adjustments to the model structure. The system swapped the initial classification layer, intended for the 1,000-class ImageNet dataset, with a fully connected (dense) layer consisting of 3 output neurons. The investigation set all layers of the pretrained EfficientNet-B0 backbone as trainable and optimized them jointly with the new classification layer, following a full fine-tuning strategy. The training process did not apply staged freezing or gradual unfreezing. The framework performed training using the Adam optimizer with a learning rate of 0.00005 and a step learning-rate scheduler that reduced the rate by a factor of 0.1 every 7 epochs. These neurons corresponded to the three target classes within the BUSI dataset: benign, normal, and malignant. The model utilized a SoftMax activation on the output layer, transforming raw scores into probability distributions, as calculated in equation 7. Table 2 shows the description of the model architecture.
| Layer Type | Description |
| Input Layer | Accepts images of size 224 x 224 x 3 |
| Convolutional Layers | Includes initial convolution and MBConv blocks for feature extraction. |
| Global Average Pooling | Reduces spatial dimensions to a 1D feature vector. |
| Fully Connected Layer | Maps the feature vector to 3 output neurons (normal, benign, malignant). |
| SoftMax Activation | Converts logits into probabilities for multi-class classification. |
Table 2: Shows the description of the model architecture.
The parameters used to train the model were the Adam optimizer, an initial learning rate of 0.00005, and categorical cross-entropy loss (three classes), mathematically calculated as per equation 9. ImageNet pretrained weights were used to initialize EfficientNet-B0, and fully trained, and all the backbone layers were trained and no staged freeze was applied. The initial classifier was substituted with a fully connected layer and three output neurons and finally SoftMax activation, which represented benign, malignant, and normal classes. The images were downsized to 224 x 224 pixels, and then normalized by ImageNet mean and standard deviation and loaded in mini-batches of 8. A step learning-rate scheduler decreased the learning rate by 0.1 in every 7 epochs. Validation loss was used to stop early with patience of 2 epochs in order to avoid overfitting. Grad-CAM activations were produced based on the last convolutional layer by calculating gradients, global average pooling to obtain channel weights, and upscaling the activation maps to input resolution.
In order to confirm that stability, training was repeated over several runs with Matching validation performance. Monitoring of training and validation loss curves was used to identify overfitting and a stratified hold-out test set did not depend on training was used to perform final evaluation. The difference between the validation performance and test performance is small which shows consistent convergence and optimization reproducibility. Equation 10 defines a stepwise learning rate schedule. Equation 15 signifies two consecutive increases in validation loss, indicating possible overfitting.
EfficientNet-B0 is particularly flexible to the task of breast ultrasonography classification as the architecture includes several innovative architectural aspects that enable the efficiency and speed of the architecture. The central part is the MBConv, which unites depthwise separable convolutions with squeeze-and-excitation modules to provide lower computational complexity without loss of accuracy. The model also supports a scaling of compounds that scales depth, width and resolution uniformly to have optimum performance in all kinds of resource constraints and the model is also task agile which is mathematically derived as in following equation 8. The model already contains weights that are trained on the ImageNet dataset and therefore it can be transferred to learn and to a significant extent one does not need to train the model on small medical datasets. This combination of high-level features, as well as the customized training environment, which also contains the changes to the final classification layer, the use of the Adam optimizer, the learning rate scheduler and the early stopping, make sure that the model is highly accurate, but at the same time, it is computationally efficient. This combination of architectural inventiveness and training approaches will enable the model to learn efficiently based on the poor medical imaging information and therefore prove a helpful instrument in the categorization of breast ultrasound.
Algorithm 1: Breast Ultrasound Image Classification Using EfficientNet-B0 and Grad-CAM
Input: Breast ultrasound images.
Output: Classification (normal, benign, malignant) and Grad-CAM heatmaps.
1. Preprocess Images:
Resize to 224×224224×224.
Normalize using ImageNet mean and std.
Augment with flipping, rotation, and color jittering.
2. Initialize Model:
Load EfficientNet-B0 (pre-trained, exclude top layer).
Add Global Average Pooling and a dense layer (3 neurons, SoftMax).
3. Train Model:
Optimizer: Adam (learning rate = 0.00005).
Loss: Cross-entropy.
Train for 18 epochs with early stopping (patience = 2).
4. Grad-CAM Visualization:
Compute gradients of the target class.
Generate and overlay heatmaps on input images.
The algorithm provides the workflow of classifying breast ultrasound images with EfficientNet-B0 and Grad-CAM, including preprocessing, training, and inference of the model, and the interpretable heatmap to validate clinically.
6. Grad-CAM as Explainable AI
A strong tool utilized in the study is able to add interpretability to the deep learning model, Gradient-weighted Class Activation Mapping. Considering a solution to the current dire need of XAI in clinical imaging, the method employs Grad-CAM to enable clinicians to intuitively understand the decision-making of the model that occurred through visually explainable heatmaps around the most significant regions in an image utilized in the prediction of the model. Such convergence of interpretability will be critical to building trust in AI systems, particularly high-stakes systems such as healthcare, where accountability and transparency are essential. Figure 5 presents Grad-CAM heatmaps highlighting the image regions that most influenced the model’s predictions for benign, malignant, and normal cases.

Figure 5: Grad-CAM activation heatmaps. (A) Benign, (B) Malignant, and (C) Normal cases. Warm colors (red) signify high diagnostic influence; cool colors (blue) indicate lower attention. Please click here to view a larger version of this figure.
All Grad-CAM images are shown in addition to the input image that the model was fed. Figure 5 exhibits the original ultrasound image and the model-input image to ensure that the regions of close attention are directly related to the processed input that is considered in the prediction. To obtain channel-wise importance, Grad-CAM is used to compute gradients of the predicted class using feature maps of the final convolutional layer. These gradients are averaged worldwide to get the weighting coefficients, and are then multiplied with the respective feature maps to produce a localization heatmap (equations 16 and 17). The heatmap is then upsampled to the input resolution and superimposed on the ultrasound image to indicate areas of diagnostic interest (i.e., tumor margins, changes in texture, etc.) in the image. This visualization offers a spatial understanding of the areas that are used to make the prediction in the model. This paper measures the interpretability by visual analysis of the localization of attention. They did not include quantitative measures of localization and clinician-based assessment that are still valuable avenues of future validation of clinical relevance.
Grad-CAM gives predictions on how a model works visually by highlighting areas of an image that contributed to the prediction. The heatmaps can be used to confirm that the network concentrates on clinically relevant structures that include tumor margins and acoustic patterns. In false alarms, the visualizations can aid in noticing focus on irrelevant areas, assisting in model assessment and optimization. Such correspondence between medical features and the model's attention enhances the interpretability of AI-assisted diagnosis. Figure 6 shows the GradCAM visualization.

Figure 6: GradCAM visualization. Grad-CAM heatmap showing important regions (red = high, blue = low) for classification; color intensity indicates feature importance. Please click here to view a larger version of this figure.
Grad-CAM is added to increase the interpretability of the EfficientNet-B0 model by highlighting image regions that make the maximum contributions to classification choices. The resulting heatmaps indicate clinically significant structures of tumor margins and texture changes, which facilitate clear analysis of model predictions. This high classification accuracy and visual interpretability make this proposed framework stronger in terms of reliability in analyzing the breast ultrasound images.
7. Statistical Analysis
The model is evaluated on a large set of metrics, such as precision, as a measure of the accuracy of positive predictions; recall, as a measure of the effectiveness of the model at identifying all relevant cases; and the F1 score, the harmonic mean of precision and recall, providing a balanced measure of overall performance which are mathematically calculated as per equations 18–21.
Mean Absolute Error (MAE) quantifies the average absolute difference between predictions and true values, mathematically calculated as per equation 22.
Root Mean Squared Error (RMSE) measures the average deviation of predictions from true values, mathematically calculated as per equation 23.
Mean Absolute Error (MAE) quantifies the average absolute difference between predictions and true values, mathematically calculated as per equation 24.
Confusion matrix is utilized to provide a detailed breakdown of true positives, true negatives, false positives, and false negatives by class, providing further information on how the model classifies30. These measures of evaluation in combination provide an overall assessment of the model error, strength, and generalization on unobserved data and therefore make it a reliable tool of classifying images of a breast ultrasound. The availability of lesion masks in the BUSI dataset implies that in the future, quantitative localization measures, e.g., the intersection-over-union, can be introduced to objectively assess the accuracy of the explanation. The stratified hold-out validation had been chosen in order to preserve the independence between the training and evaluation subsets retaining the class distribution. The reason for not using cross-validation was to prevent repetitive exposure of validation samples to the optimization process but in future studies repeated or nested cross-validation will be used to further determine variations in performance.