An Explainable Privacy Preserving Multimodal Ensemble Framework For Skin Lesion Classification

Amrita Koul; N. P. Singh

doi:10.3791/71472

Research Article

An Explainable Privacy Preserving Multimodal Ensemble Framework For Skin Lesion Classification

DOI:

10.3791/71472

⸱

June 12th, 2026

Amrita Koul¹ , N. P. Singh¹

¹Department of Computer Science and Engineering, School of Engineering and Technology, MVN University, Palwal

Summary

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The proposed work aims to develop and evaluate an explainable, privacy-preserving multimodal ensemble fabric arrangement for accurate skin lesion classification by integrating deep learning features, clinical metadata, and explainable AI techniques to improve diagnostic accuracy, transparency, and reliable clinical decision support for early skin cancer detection.

Abstract

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Among dermatological diseases, skin cancer is among the most life-threatening. Early and accurate diagnosis is important for improving a patient's prognosis. Nevertheless, traditional AI-based diagnostic methods face several challenges, including privacy concerns, limited interpretability, and a severe class imbalance in multi-class skin lesion datasets. To overcome these challenges, the proposed paper proposes a privacy-aware, explainable multimodal skin lesion classification model that combines complex deep learning models and an ensemble modeling approach with explainable artificial intelligence methods. Experimental evaluation is conducted using publicly available HAM10000 benchmark data on multi-class skin lesion classification that can be accessed by means of Kaggle Hub, distributed over seven clinically significant lesion classes (akiec, bcc, bkl, df, mel, nv, vasc). To balance the data, a class-balancing technique is used to boost the minority classes. The EfficientNet B4, DenseNet201, and MobileNetv2 are used to extract deep feature representations, afterward combined with salient clinical metadata to create a robust multimodal feature space. These multimodal features are used to train XGBoost, LightGBM, Deep Neural Classifier (DNC) that resulted classification accuracies of 92%, 90% with 94% respectively. A stacked ensemble strategy is applied to combine the outputs of XGBoost, LightGBM, and Deep Neural Classifier (DNC), which leads to an improvement in accuracy of 96%. Model interpretability techniques provide feature-level explanations that increase transparency. The experimental findings proved the practicality of the suggested framework in terms of efficiency with clinically relevant real-life classification of skin lesions.

Introduction

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Skin cancer represents a significant global health burden, with increasing incidence rates reported worldwide¹. Artificial radiation is recognized as a major contributing factor to skin cancer, leading to genetic mutations that result in uncontrolled cell proliferation and tumor development in skin cells¹^,²_.Skin cancers comprise a group of diseases, including melanoma, squamous cell carcinoma, and basal cell carcinoma (bcc). The causes, clinical presentation, and prognostic factors of these conditions all differ³. Skin diseases have become an obstacle in medical diagnosis due to pixel-level similarities⁴. In 2022, there were 331,722 estimated melanoma cases (58,667 deaths) and 1.2 million NMSC cases (69,416 deaths) globally. The peak death rate age-standardized incidence rates (ASR) for melanoma were in Oceania (29.78/100,000), North America (16.3), and Europe (10.43). However, the mortality-to-incidence ratio was highest in Africa (0.35) and Asia (0.30) compared to North America and Oceania (0.02 in both), which may reflect a poorer prognosis¹. In dermatology, the diagnosis and monitoring of skin lesions have primarily relied on visual examination and other non-invasive assessments. Invasive methods are not applied because they can damage the lesions and prevent the performance of a clinical follow-up of the lesion growth⁵. Skin lesions can be of different types: melanoma (MEL), dermatofibroma (DF), actinic keratosis and intraepithelial carcinoma (AKIEC), basal cell carcinoma (BCC), benign keratosis (BKL), melanocytic nevus (NV), and vascular lesions (VASC), as defined in the HAM10000 dataset⁵. Major challenges in the classification of dermatoscopic images are the presence of hairs, inks, ruler marks, colored patches, glimmers, drops, oil bubbles, blood vessels, hyperpigmented areas, and/or inflammatory lesions⁶.There have been studies previously on feature selection and deep learning for medical imaging and skin lesion classification⁷^,⁸.

Computer vision-based approaches for skin cancer diagnosis and the integration of handcrafted and deep features have also been investigated⁹, along with feature fusion strategies for improved classification performance¹⁰. Recent advancements further emphasize the integration of machine learning in healthcare systems and secure medical data processing frameworks¹¹^,¹²_.AI healthcare utilization powered by advanced computational algorithms has the potential to deliver personalized and efficient integrated care programs, especially beneficial for patients in remote and home care settings¹³. By utilizing extensive datasets of dermatoscopic images, deep learning models—particularly Convolutional neural networks (CNNs)—can be trained to accurately identify and classify various skin lesions. Several techniques show strong outcomes in skin lesion segmentation, including Fully convolutional networks (FCNs), CNNs, Deep CNNs (DCNNs), Fully convolutional residual networks (FCRNs), and U-Net architectures. Deep neural networks (DNNs) are not easily interpretable due to their highly complex architecture, so their decision-making process is hard to comprehend¹⁴^,¹⁵. Recent advances in medical image analysis have demonstrated that deep convolutional neural networks (CNNs) significantly improve efficiency in skin lesion classification tasks. Several studies on dermoscopic datasets such as HAM10000 have shown that CNN-based architectures, including ResNet, DenseNet and EfficientNet, achieve strong multi-class classification performance by learning hierarchical feature representations from lesion images. Hybrid feature fusion approaches, where multiple CNN backbones are combined, have further improved diagnostic accuracy by integrating complementary deep representations¹⁶. Moreover, current studies have investigated hybrid CNN Transformer models in medical image analysis. Models with vision transformer and CNN feature extractors have been proven to have better outcomes in skin lesion classification tasks because they are better able to extract local texture content as well as global contextual relationships¹⁷. These hybrid designs are also being viewed as state-of-the-art in medical imaging because they have a balanced representation learning ability.

In other areas of medicine, feature fusion strategies have been extensively used outside dermatology. CNN-based hybrid systems have also been applied in the analysis of histopathological images to achieve better classification of lung and colon cancer with enhanced feature representations and spatial learning dynamics¹⁶. Equally, in ophthalmology, the use of deep learning models trained on fused feature representations has demonstrated successful application in diabetic retinopathy staging of fundus images, with better robustness and classification accuracy in a multi-class grading task¹⁸.Multimodal fusion methods in these fields all suggest that heterogeneous feature representations yield better generalization and classification, especially in imbalanced medical data¹⁹.

Although these improvements have been made, the current practices are usually limited to being multimodal, not integrated, inadequate to address the issue of class imbalance, and unhelpful in clinical decision-making. To overcome these issues, this paper presents an explainable skin lesion classification model that is privacy-conscious and integrates both model interpretability methods. Such explainability methods can be used to explain the model's predictions, showing which features are most important and highlighting significant areas of dermoscopic images, enhancing clarity and confidence in clinical procedures, thereby improving clinical transparency, building trust, and supporting the safe implementation of AI systems in clinical practice. There is a significant imbalance in the HAM10000 dataset, with a few classes having far fewer samples than others. To overcome this problem, the synthetic minority over-sampling technique (also known as class balancing) is used to generate synthetic samples for underrepresented classes. Class balancing techniques balance the dataset, enabling the model to learn better from minority lesion types, increasing sensitivity, and allowing more reliable prediction of clinically significant yet less frequent classes of skin cancer. Deep features of EfficientNet-B4, DenseNet201, and MobileNetV2 are combined with the clinical metadata to form a more informative representation of every skin lesion. This dual feature helps us to extract the visual patterns of dermoscopic images and other patient information for a more in-depth analysis. The features are then trained on different classifiers, including XGBoost, LightGBM, and a Deep Neural Network, to enhance the ability and power of the skin lesion classification model. The ensemble of the models is used with a stacking ensemble technique to enhance the model. This is a composite model that leverages the strengths of multiple models to learn from and benefit from the predictions of all models in the ensemble while mitigating their limitations.

Access restricted. Please log in or start a trial to view this content.

Protocol

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

This study used publicly available, fully anonymized dermoscopic datasets and involved no direct human participation; therefore, ethical committee approval was not required. The Table of Materials contains details of all the materials or tools used in this study. Table 1 includes details of the hardware and software environment, such as processor type, memory, operating system, and software frameworks. Table 2 includes details of the class-wise precision, recall, F1-score, and support for each skin lesion category.

Overall workflow of the proposed multimodal skin lesion classification framework

The general plan of this research is to create a precise and comprehensible scheme of multi-classification of skin lesions. The workflow starts with data collection and preprocessing of the HAM10000 dataset, then proceeds to feature extraction using deep learning architectures and the inclusion of clinical metadata. Afterward, several machine learning classifiers are trained and optimized, and their results are aggregated in an ensemble strategy. Lastly, the predictions of the model are interpreted using explainability techniques, and the effectiveness of the model is evaluated for use in real-world clinical decision support.

In order to improve the predictive accuracy of the proposed system, a multi-modal machine learning pipeline is used, which combines both image-based features and clinical metadata (as shown in Figure 1. The model can sum up the visual outputs of dermoscopic images with the information related to the patient to identify more detailed patterns related to various skin lesions. With such a combination, the system can make better predictions, which will ultimately. Improve the quality and usefulness of skin lesion classification. Three pre-trained convolutional Deep features are extracted with the help of neural networks (EfficientNet-B4, DenseNet201, and MobileNetV2): they are capable of capturing a variety of complementary patterns of dermoscopic images. These architectures learn high-level patterns in how skin lesions look, like changes in color and texture, and the way they are built. Then, a feature fusion module combines the deep features with the clinical features and demographic data to make a rich multi-modal feature. The merged data is then separated into training, validation, and testing data to ensure appropriate model testing. Next, a feature fusion module is used to merge the deep features with the clinical features and demographics to produce a rich multi-modal feature. This data is then split into training, test, and validation data to test the model. An ensemble strategy is used to further enhance prediction accuracy. This is done by averaging the results of several models and coming up with the final prediction using those averaged probabilities to enhance generalization and minimize the variance that would otherwise have been caused by individual models. Besides this, explainability methods, such as model interpretability techniques, are also integrated to further explain how the model makes its decisions. The method of model interpretability provides feature-level interpretations by quantifying the contribution of input variables, whereas the method of model interpretability identifies important areas within dermoscopic images at the pixel level that affect the prediction. Model interpretability techniques offer feature-level explanations by quantifying the contribution of each input variable, while model interpretability techniques highlight important regions at the pixel level within dermoscopic images that influence the prediction. Combined, these techniques make the models more interpretable and help clinicians to learn about the way the system makes the decisions. As a result, the proposed pipeline provides a system that is understandable and privacy-conscious, increasing transparency and trust and enabling more dependable skin cancer diagnosis in a real-world healthcare setting.

Dataset description with preparation

In this paper, the HAM10000 (Human against Machine with 10,000 training images) dataset is used as the primary dataset for multi-class skin lesion classification. The dataset contains over 10,000 dermoscopic figures collected from various medical sources. Clinical sources and populations, making it one of the most widely used benchmark datasets in dermatological image analysis. Each image in the dataset is accompanied by important clinical metadata, including image identifiers, diagnostic labels, patient age, sex, and the anatomical location of the lesion. The dataset covers seven diagnostic categories: actinic keratoses (akiec), basal cell carcinoma (bcc), benign keratosis (bkl), dermatofibroma (df), melanocytic nevi (nv), vascular lesions (vasc), and melanoma (mel).

Clinical metadata preprocessing

Auxiliary features added to the classification pipeline included clinical metadata, such as age, sex, and the lesion's location in the patient. There were missing or unknown values, which were treated through a deterministic preprocessing approach. In the case of the age variable (numerical), the median age calculated on the training set was used to impute the missing values. The reason median imputation was chosen is that it is resistant to outliers and skewed data, which are prevalent in clinical data. For sex and lesion location (categorical variables), missing or unspecified values were not excluded; they were assigned to a special category labeled 'unknown'. The method maintains all available samples, and the model is free to determine whether missingness itself is predictive. One-hot encoding was then applied to categorical variables to enable them to be compatible with machine learning models. All preprocessing, such as imputation, encoding, etc., was only done on the training set, and the same transformations were done to the validation and experiment sets to avoid data loss. There were no samples excluded just because of missing clinical metadata, and this ensured that the data was maximally utilized, and there was methodological consistency.

Skin lesion classification process diagram using AI; dataset preparation, feature extraction, fusion.
Figure 1: Multimodal system for skin lesion classification. The study approach combines dermoscopic image features with patient metadata to classify skin lesions using ensemble deep learning models. The framework includes preprocessing, feature extraction, multimodal fusion, and classification, allowing for enhanced diagnostic performance and interpretability. Please click here to view a larger version of this figure.

The workflow depicts the suggested classification pipeline, based on dermoscopic images and clinical metadata of the HAM10000 skin lesion dataset. EfficientNet-B4, DenseNet201, and MobileNetV2 are used to preprocess and extract deep features in images. The clinical metadata are coded, and feature fusion is used to combine the image features with the clinical metadata. In order to address the issue of class imbalance, the class-balancing technique is used in the fused multimodal feature space instead of the raw images or individual feature streams, where synthetic samples maintain the combination of both the visual and clinical features and do not produce unrealistic samples. The merged features are then trained on classifiers such as XGBoost, LightGBM, and a deep neural classifier.

Dermatology analysis; Skin lesion images A-G; Clinical examination; Melanoma detection study.
Figure 2: Example dermoscopic images from seven different diagnostic groups from the HAM10000 dataset. Images show typical visual features used for automated classification. (A) Actinic keratoses (akiec), demonstrating rough surfaces with irregular pigmentation. (B) Basal cell carcinoma (bcc), with irregular shapes and blood vessels. (C) Benign keratosis-like lesions (bkl), showing keratotic features with light brown surfaces. (D) Dermatofibroma (df), with a central scar-like appearance and pigmentation. (E) Melanocytic nevi (nv), benign and relatively symmetric moles. (F) Vascular lesions (vasc), showing a reddish-purple appearance due to blood vessels. (G) Melanoma (mel), which presents as an irregularly shaped, asymmetric, and multi-pigmented lesion. Please click here to view a larger version of this figure.

These dermoscopic images reveal the visual heterogeneity of skin lesions, which have variations in pigmentation, texture, and morphology of the structure. These variations pose a great challenge to automated classification systems and stress the significance of deep learning-based systems. Feature extraction techniques that are sensitive to revealing subtle diagnostic patterns. Following the dataset description, Figure 2 illustrates the seven categories of skin lesions included in the HAM10000 dataset, which are commonly studied in dermatological diagnostic imaging research. These classes include Actinic Keratoses (akiec), Basal Cell Carcinoma (bcc), Benign Keratosis (bkl), Dermatofibroma (df), Melanocytic Nevi (nv), Vascular Lesions (vasc), and Melanoma (mel)²¹. All these types of lesions have unique visual features, as shown in Figure 3, which include variation in pigmentation patterns, surface texture, color distribution, and abnormalities along the lesion borders. The visual characteristics of all these lesions are different, and they are characterized by variation in patterns of pigmentation, surface texture, color distribution, and abnormalities on the borders of the lesions. These are important characteristics that dermatologists would have in mind when conducting the clinical examination, and therefore have to be well modeled by machine learning models in order to attain the right classification. Even though these are the differentiating characteristics, many of these lesions appear virtually identical, which makes it difficult to differentiate between them when looking at merely dermoscopic images. The distinction between certain types of lesions is typically extremely subtle but clinically pertinent, making it challenging to classify automatically. This is why it is urgent to create potent AI models capable of training to learn fine-grained visual images and subtle differences in lesions among lesion classes. These properties will not only be enhanced by the appropriate description, which will result in the improvement of the discriminative skills of the model with different types of lesions, but also help to diagnose some perilous conditions, such as melanoma, earlier. Lastly, it can enhance the diagnostic accuracy, inform clinicians in making decisions that result in improved patient outcomes, and help make better decisions.

Skin cancer class distribution, bar chart, visualizing image count per type: NV, MEL, BCC, etc.
Figure 3: Class-wise distribution of skin lesions in the HAM10000 dataset. The figure shows the distribution of the seven lesion categories considered in this study: Actinic Keratoses (akiec), Basal Cell Carcinoma (bcc), Benign Keratosis-like lesions (bkl), Dermatofibroma (df), Melanocytic Nevi (nv), Vascular Lesions (vasc), and Melanoma (mel). This graph illustrates the class imbalance of the lesion classes. Please click here to view a larger version of this figure.

The analysis of the dataset shows that there is an imbalance in the classes of the different types of lesions. The most common type of Melanocytic Nevi (nv), with approximately 6,705 samples, is the most common type, followed by Melanoma (1,113) and Benign Keratosis (1,099). On the contrary, there are some forms of lesions of clinical significance that are significantly less represented, such as Dermatofibroma (115) and Vascular Lesions (142). This disproportion poses a threat to machine learning models because they may have a tendency to be biased towards the majority classes and are incapable of having the potential to detect unusual but clinically significant lesions. To deal with this issue and improve the training of the model on the model performances with respect to all the classes, advanced preprocessing is required. Strategies are needed. These include techniques like targeted data augmentation and class balancing. The data can be balanced using the technique (Class balancing technique and class weight adjustment which encourages the model to discover substantial trends in the underrepresented classes. The hyperparameters used for XGBoost and LightGBM were primarily set to their default configurations, with minor adjustments based on preliminary experiments. For the deep neural classifier, architectural and training parameters such as the number of layers, neurons, learning rate, batch size, and number of epochs were selected empirically using validation data. The complete set of hyperparameters is provided in Table 3. In general, the number of dermoscopic images utilized in the present study is 10,015 altogether. This has the benefit of providing a vast collection of data to be trained and tested, and it is a tedious yet rewarding yardstick as well. Appraise the effectiveness of the proposed skin lesion classification system.

Data preprocessing

The preprocessing pipeline prepares the HAM10000 dataset for multimodal learning by standardizing images, extracting deep features, integrating clinical metadata, and addressing class imbalance.

Image Standardization: All dermoscopic images were resized to 224 × 224 pixels and normalized using z-score normalization.

Static equilibrium; formula I_norm=I_μ/σ; equation; statistical analysis; educational use. (1)

Where I represent the raw image, µ denotes the pixel-wise mean, and σ is the standard deviation.

Deep Feature Extraction: Complementary deep features were extracted using three pre-trained convolutional neural networks: Efficient-Net B4, DenseNet201, along with MobileNetV2. Each network maps the normalized image to a feature vector.

Image classification equations, neural network functions, symbol, research formula analysis. (2)

The extracted features were concatenated to form a unified representation:

F_Fusion=F_EffB4 ||F_Dense ||F_MobV2 (3)

(where || means concatenation)

Clinical Metadata Integration: Clinical attributes, including age, sex, along with lesion localization, were cleaned, label encoded, and normalized using min-max scaling:

$Scaling formula, $x_{\text{scaled}} = \frac{x - x_{\text{min}}}{x_{\text{max}} - x_{\text{min}}}$, equation.$ (4)

The processed metadata vector M_clinicalwas fused with image features to construct the final multimodal input:

F_combined=F_fusionM_clinical (5)

Dataset Splitting: A stratified split was applied to preserve class distribution

D_train,D_test=Split(F_comibed,0.8) (6)

Class imbalance handling: The HAM10000 data set has a severe imbalance of the classes, where” nevus” (NV) samples prevail as underrepresented in other minority groups, like DF with VASC. In order to reduce this problem, the” Synthetic Minority Oversampling Technique” (class balancing technique) was employed. Using: New synthetic samples were produced as:

x_new=x_i+ λ(x_zi- x_i) (7)

λ~U(0,1) symbol in probability theory equation; uniform distribution concept.

Where x_i is a minority class sample, x_zi is one of its nearest neighbors, and λ is a random value sampled from a uniform distribution between 0 and 1. The synthetic sample, as shown in Figure 4, is generated along the line segment joining x sub i. and xent joining x_i and x_zi.

Class distribution charts show data imbalance before and after SMOTE; bar graphs illustrate sample counts.
Figure 4: Class distribution in the HAM10000 dataset before/after applying the class balancing technique. (A) Before class balancing, with imbalance across lesion classes. (B) After class balancing in the combined feature space, where the representation of all classes is equal to avoid bias in the classifier training process. Please click here to view a larger version of this figure.

To address the issue of class imbalance in the HAM10000 dataset, the Synthetic Minority Over-Sampling Technique (class balancing technique) is applied. Class balancing technique generates synthetic samples for the minority classes by interpolating between existing data points, which helps increase the representation of underrepresented lesion categories. The end result of producing more examples of these minority classes is a more balanced dataset overall, with respect to all seven lesion types. This balanced representation will enable the classification models to learn better with every class and minimize the bias with the majority classes. Consequently, the model is fairer in classification and sensitive, especially to rare, yet clinically important skin lesions.

Privacy-preserving learning framework

The suggested system proposes a multimodal system of automated lesions classification on the skin that is privacy-aware and interpretable. The ultimate aim of the system is to enhance the diagnostic performance and at the same time, safeguard sensitive patient information throughout the training process. Patient privacy is an essential need in medical practice because healthcare data privacy laws and ethical considerations are highly important in healthcare settings. Thus, the suggested model will include a decentralized learning model that is based on the ideas of federated learning. In this decentralized environment, model training is accomplished on a group of distributed clients instead of aggregating all patient data in a centralized location. All participating clients train the model locally on their own data, and raw patient data do not leave the local environment. As an alternative to moving sensitive medical records, model updates or parameters are sent to a central server to be aggregated. This cooperative approach to learning enables the various institutions or sources of data to contribute to model training without compromising on data privacy.

Let w_t^(k) be the model parameters of the kth client at the tth iteration, and let n_k be the sample size at that client. The update of the global model is calculated as:

Static equilibrium equation, formula ΣK=1^K (nk/N) Wt^(k); mathematics diagram. (8)

This aggregation strategy ensures that clients with larger datasets contribute proportionally more to the global model while still allowing smaller clients to participate in the learning process. By enabling collaborative training without exchanging raw patient data, the proposed framework maintains privacy while still benefiting from distributed knowledge across datasets.

Federated experimental setup

A simulated federated learning system with the HAM10000 dataset was designed to confirm the efficiency of the offered privacy-aware framework. The data was divided into three clients to simulate a real-life multi-institutional environment with non-identically distributed (non-IID) data. Every client has a varying mix of lesion classes, and it represents a variation in the world between clinical centers. The identical multimodal feature extraction pipeline (EfficientNet-B4, DenseNet201, MobileNet V2, and clinical metadata) was locally run at every client. In their training, clients updated their local models on their own, and the learned parameters were only exchanged with the central server to be aggregated by the FedAvg algorithm. The trade-off between predictive accuracy and privacy was compared between the federated model and the centralized training approach to measure the performance of each. Test outcomes indicated in Figure 5 shows that the federated model can perform competitively, with only a slight decrease in accuracy relative to centralized learning, and much improved data privacy.

HAM10000 dataset distribution graph; lesion classes; non-IID; sample count; machine learning analysis.
Figure 5: Client-wise distribution of the HAM10000 dataset. This shows the allocation of skin lesion data among clients, demonstrating the diversity in data distribution. This demonstrates the heterogeneity of data among clients, a critical aspect of federated learning. Please click here to view a larger version of this figure.

Heterogeneous (non-IID) distributions of clients formed in HAM10000 were divided into three groups to model real-life clinical conditions. The distribution of different categories of lesions within each client is different, especially the class of nevus (nv), which is not evenly distributed across clients. This arrangement is indicative of the real-world difficulties of federated learning, in which data in institutions are not evenly distributed.

Performance comparison: centralized vs federated learning

To evaluate the effectiveness of the proposed federated learning framework, a comparative analysis was conducted between centralized and federated training strategies using the HAM10000 dataset, as shown in Figure 6. In the centralized setting, all data samples were aggregated into a single training pool. The best-performing centralized model, the stacked ensemble, achieved an overall accuracy of 96%. In contrast, the federated setting distributed the dataset across three clients with non-identically distributed (non-IID) data, where each client trained the model locally and shared only model parameters using FedAvg. The federated model achieved an overall accuracy of approximately 94%, corresponding to a performance difference of 2% compared to the centralized approach, as shown in Table 4. This marginal decrease is expected due to decentralized optimization and heterogeneous data distribution across clients.

Even though this small change happened, the federated model still did well at predicting. In centralized training, class-wise behavior shows that the majority of classes, like nevus (nv) (F1-score = 1.00), stay stable, while minority classes, like dermatofibroma (df) (F1-score ≈ 0.65–0.66), are more sensitive to distribution imbalance, which could affect federated performance even more. Notably, the federated structure minimizes the chances of exposing sensitive patient information since it does not require the sharing of raw medical data among clients.

Performance comparison chart: centralized vs federated learning; accuracy percentage results.
Figure 6: Federated learning vs. centralized learning comparison. This figure compares learning paradigms using performance metrics such as accuracy, precision, recall, and F1-score. This demonstrates the capability of federated learning to achieve performance comparable to that of the traditional learning approach while preserving privacy. Please click here to view a larger version of this figure.

The Table 4 results indicate that the federated learning model is capable of being competitive, and the drop in accuracy is only by a slight amount of approximately 2% compared to the centralized one. This slight reduction can be explained by the decentralized optimization and non-IID data distribution. However, the federated model has a tremendous advantage as far as privacy protection is concerned, as the sensitive patient information is not shared among the clients. To provide a fair comparison of the federated model and the centralized stacked ensemble model, the federated model was tested with the same architecture and hyperparameters. The privacy-preserving aspect discussed in this study is conceptual and intended to highlight the potential integration of techniques such as federated learning in future work. No experimental validation of privacy-preserving mechanisms is performed in the current implementation.

Multimodal feature fusion

The diagnosis of skin lesions usually includes skin observation and clinical history. Dermatologists, in most cases, do not only consider dermoscopic images by placing them in relation to the patient information (age, sex, and location of the lesion) to make their diagnostic judgments. The proposed system is based on the inspiration of this clinical workflow and incorporates a multimodal approach to learning to combine image-based and clinical data. CNNs are trained on pre-existing dermoscopic image deep features. Such networks recognize intricate visual designs, including color changes, lesion forms, structural anomalies, and texture features. Nevertheless, the features of images might not be sufficient to capture the clinical situation of a lesion. Clinical metadata related to every image is thus also included in learning. A feature fusion module will be created that will integrate deep image features with processed clinical attributes and demographic information. This composite representation constitutes an integrated multimodal feature representation that consists of both visual and contextual information of every lesion. The model can integrate several data sources to obtain complementary patterns that enhance overall classification ability. The multimodal representation allows the system to more effectively differentiate between visually similar lesions as well as factor in the clinical indicators. The model is more clinically meaningful and effective as it is a closer approximation of how dermatologists study lesions in clinical practice.

Stacked ensemble learning
The proposed framework uses a stacked ensemble learning strategy to further improve the predictive ability of the system. Ensemble learning is a composite method of predicting that uses two or more predictive models to enhance generalization and minimize the errors of prediction that can occur with single models. Multiple base learners are independently trained on the multimodal feature representation rather than using a single classifier. All base learners provide an estimate of how likely a particular sample is to be of a particular lesion class. These probability predictions are then aggregated at a meta-level. A weight is assigned to each base learner to show its relative importance to the end prediction. A softmax activation function is used to calculate the aggregated output to generate normalized class probabilities. The stacked ensemble method has a number of benefits. First, it minimizes prediction variance due to the combination of various models and thus enhances the performance of the generalization. Second, it enhances strength since various models describe various trends in the data. Third, ensemble learning enhances the classification of minority lesion classes, especially in medical data, where certain conditions of clinical interest are not as prevalent.

Explainable artificial intelligence integration

Medical AI systems should also offer clear explanations of their choices, even though high prediction accuracy is critical. To place trust in AI systems and be effective in their practice, clinicians should be able to comprehend how a model fits to the diagnosis it produces. In order to meet this need, the proposed framework incorporates explainable artificial intelligence (XAI) methods, as depicted in Figure 7.

Confusion matrices comparing classification models: XGBoost, LightGBM, DNN, Stacked Ensemble.
Figure 7: Confusion matrices of different classification models for multi-class skin lesion classification. (A) XGBoost, (B) LightGBM, (C) Deep Neural Classifier, and (D) Stacked Ensemble model. Each confusion matrix shows the relationship between the true class (rows) and the predicted class (columns) for all seven types of skin lesions: akiec, bcc, bkl, df, mel, nv, and vasc. The XGBoost and LightGBM models perform well for the nv and bkl classes, though there is some confusion between mel and nv. The Deep Neural Classifier improves the classification of bkl and df and decreases off-diagonal confusion. The Stacked Ensemble model shows the greatest classification consistency, with the diagonal becoming increasingly dominant. Please click here to view a larger version of this figure.

The system includes two popular explainability approaches (model interpretability technique (SHapley Additive Explanations) and model interpretability technique (Local Interpretable Model-agnostic Explanations)) to give an insight into what the model predicts. The model interpretability method explains features at the level of features by measuring the extent to which each input feature has contributed to the overall prediction. It assists in determining which clinical variables/ visual qualities have the most impact on the result of the classification. This enables researchers and clinicians to see the model's overall behavior across the dataset. Model interpretability technique, on the other hand, deals with local explanations of individual predictions. It emphasizes the areas of the dermoscopic image that have the greatest impact on the model's decision. These pixel-level visual explanations enable clinicians to visually inspect the areas of the lesion that informed the classification. The proposed framework offers global and local interpretability; it is achieved by integrating the model interpretability technique. The dual-explanation mechanism enhances transparency and enables clinicians to assess whether the model is targeting medically significant patterns.

Clinical decision support potential

Privacy-preserving learning, multimodal feature fusion, ensemble modeling, and explainable AI are key components of an integrated and robust system for automatic skin lesion classification. Ideally, the system should not only have high prognostic power, but also be transparent and secure, which are two key factors in medical systems, as shown in Figure 8.

ROC curve analysis; multiple classifiers; diagram comparison; AUC metrics for performance.
Figure 8: Receiver operating characteristic (ROC) curves for the stacked ensemble model. (A–C) This shows the ROC curves for the seven skin lesion types, with true positive rate (sensitivity) and false positive rate (1-specificity). The area under the curve (AUC) represents the performance of the stacked ensemble model in discriminating between the classes. Please click here to view a larger version of this figure.

This system provides explainable predictions and privacy protection. As a result, it is a beneficial system for other dermatological diagnostic systems. This system allows health practitioners/ dermatologists to assess lesion suspiciousness and improve diagnostic accuracy and, as a result, help practitioners/ dermatologists to diagnose patients at an early stage when they may have a more serious disease (e.g., melanoma). In essence, as shown in Figure 9, this system seeks to bring the technologies of using high-tech artificial intelligence (AI) systems and implementing real-world applications into practice, to help dermatologists diagnose patients more accurately and with more confidence while ensuring the privacy and security of patients and their comfort.

Machine learning feature importance analysis; diagrams and charts; prediction probability; data comparison.
Figure 9: Explainability results using model interpretability techniques for multi-class skin lesion classification. (A) SHAP plot showing feature contributions influencing benign and malignant lesion predictions. (B) LIME explanation for the bcc prediction, illustrating the features contributing positively and negatively to the classification outcome. (C) LIME explanation for the akiec prediction, highlighting the most influential features involved in the model decision-making process. These interpretability visualizations demonstrate the regions and extracted features that significantly affect the model’s predictions, improving transparency and understanding of the classification process in skin lesion assessment. Please click here to view a larger version of this figure.

Evaluation strategy

To avoid sampling bias and maintain the original class distribution across all skin lesion categories, the dataset was split into an 80:20 train–test split. The training subset was then split in the ratio 90:10 train: validate, to tune the hyperparameters and optimize the model. The test set was not used in the training process at any stage and was only applied at the end of the training process as a final test to avoid leakage of data and ensure an unbiased performance assessment. All models were pre-processed and trained in equal settings, data was partitioned and augmented in the same way, and evaluation protocols were applied and followed in the same manner, which allowed for fair and reproducible comparisons. The models were thoroughly evaluated based on accuracy, precision, recall, F1 score, and AUC, with a detailed analysis of the class-wise results to determine their robustness for both major and minority classes of lesions. This standardized validation tool would help to increase the reliability, transparency, and generalizability of the proposed approach, and overcome the potential inconsistencies in performance reporting.

Access restricted. Please log in or start a trial to view this content.

Results

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Four classification methods (XGBoost, LightGBM, a Deep Neural Classifier, and a stacked ensemble model) were evaluated for multi-class skin lesion classification. The models achieved overall accuracies of 92%, 90%, 94%, and 96%, respectively, demonstrating that c

Class-wise performance

A detailed class-wise evaluation, including precision, recall, and F1-score for each lesion category, is provided. For the akiec class (support = 65), the stacked ens...

Access restricted. Please log in or start a trial to view this content.

Discussion

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The current protocol outlines a reproducible pipeline for creating an interpretable, privacy-sensitive, multimodal framework to automatically classify skin lesions. The protocol follows a systematic pattern of enhancing diagnostic performance through model transparency, combining dermoscopic image analysis with clinical metadata and interpretable machine learning methods. The HAM10000 skin lesion dataset is publicly available and allows the standardized assessment and facilitates the reproducibility of further research i...

Access restricted. Please log in or start a trial to view this content.

Disclosures

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The authors have nothing to disclose. We have no conflicts of interest. The authors declare that artificial intelligence tools were used solely for language editing and formatting. All scientific content, analysis, and interpretations were developed and verified by the authors.

Acknowledgements

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The authors thank MVN University, Palwal, for providing academic guidance and research support. The authors also acknowledge the publicly available HAM10000 skin lesion dataset, which was used for the experimental evaluation of this study.

Access restricted. Please log in or start a trial to view this content.

Materials

List of materials used in this article
Name	Company	Catalog Number	Comments
DenseNet201 CNN Architecture	IBM	https://arxiv.org/abs/1608.06993	Deep learning model for image classification
EfficientNet-B4 CNN Architecture	Google	https://arxiv.org/abs/1905.11946	Deep learning model for image classification
Google Colaboratory Platform	Google	https://colab.research.google.com	Cloud-based computational environment
HAM10000 Skin Lesion Dataset	Harvard Dataverse	https://doi.org/10.7910/DVN/DBW86T	Dermoscopic image dataset
Keras Deep Learning API	Google	Version 2.x	Neural network API
LIME Explainability Library	LIME Project	Version 0.x	Model interpretability technique
MobileNetV2 CNN Architecture	Google	https://arxiv.org/abs/1801.04381	Deep learning model for image classification
Matplotlib Visualization Library	Matplotlib Development Team	Version 3.x	Used for generating plots and performance visualization
NVIDIA GPU	NVIDIA	RTX Series	Computational hardware for model training
NumPy Numerical Computing Library	NumPy Developers	Version 1.x	Data analysis software
OpenCV Image Processing Library	OpenCV Foundation	Version 4.x	Image processing library
Pandas Data Analysis Library	Pandas Development Team	Version 1.x	Data analysis software
Python Programming Environment	Python Software Foundation	Version 3.9+	Data analysis software
SHAP Explainability Library	SHAP Project	Version 0.x	Model interpretability technique
SMOTE Oversampling Technique	imbalanced-learn Project	Version 0.x	Class balancing technique for handling imbalanced datasets
Scikit-learn Machine Learning Library	scikit-learn Project	Version 1.x	Machine learning library
TensorFlow Deep Learning Framework	Google	Version 2.x	Deep learning framework

Reprints and Permissions

Request permission to reuse the text or figures of this JoVE article

Request Permission

An Explainable Privacy Preserving Multimodal Ensemble Framework For Skin Lesion Classification

In This Article

Summary

Abstract

Introduction

Protocol

Results

Discussion

Disclosures

Acknowledgements

Materials

Reprints and Permissions

Tags

Related Articles