$$\rightleftharpoonup{xx}$$
$$\longleftharp{xx}$$,
$$\longrightharp{xx}$$,
This study used publicly available, fully anonymized dermoscopic datasets and involved no direct human participation; therefore, ethical committee approval was not required. The Table of Materials contains details of all the materials or tools used in this study. Table 1 includes details of the hardware and software environment, such as processor type, memory, operating system, and software frameworks. Table 2 includes details of the class-wise precision, recall, F1-score, and support for each skin lesion category.
Overall workflow of the proposed multimodal skin lesion classification framework
The general plan of this research is to create a precise and comprehensible scheme of multi-classification of skin lesions. The workflow starts with data collection and preprocessing of the HAM10000 dataset, then proceeds to feature extraction using deep learning architectures and the inclusion of clinical metadata. Afterward, several machine learning classifiers are trained and optimized, and their results are aggregated in an ensemble strategy. Lastly, the predictions of the model are interpreted using explainability techniques, and the effectiveness of the model is evaluated for use in real-world clinical decision support.
In order to improve the predictive accuracy of the proposed system, a multi-modal machine learning pipeline is used, which combines both image-based features and clinical metadata (as shown in Figure 1. The model can sum up the visual outputs of dermoscopic images with the information related to the patient to identify more detailed patterns related to various skin lesions. With such a combination, the system can make better predictions, which will ultimately. Improve the quality and usefulness of skin lesion classification. Three pre-trained convolutional Deep features are extracted with the help of neural networks (EfficientNet-B4, DenseNet201, and MobileNetV2): they are capable of capturing a variety of complementary patterns of dermoscopic images. These architectures learn high-level patterns in how skin lesions look, like changes in color and texture, and the way they are built. Then, a feature fusion module combines the deep features with the clinical features and demographic data to make a rich multi-modal feature. The merged data is then separated into training, validation, and testing data to ensure appropriate model testing. Next, a feature fusion module is used to merge the deep features with the clinical features and demographics to produce a rich multi-modal feature. This data is then split into training, test, and validation data to test the model. An ensemble strategy is used to further enhance prediction accuracy. This is done by averaging the results of several models and coming up with the final prediction using those averaged probabilities to enhance generalization and minimize the variance that would otherwise have been caused by individual models. Besides this, explainability methods, such as model interpretability techniques, are also integrated to further explain how the model makes its decisions. The method of model interpretability provides feature-level interpretations by quantifying the contribution of input variables, whereas the method of model interpretability identifies important areas within dermoscopic images at the pixel level that affect the prediction. Model interpretability techniques offer feature-level explanations by quantifying the contribution of each input variable, while model interpretability techniques highlight important regions at the pixel level within dermoscopic images that influence the prediction. Combined, these techniques make the models more interpretable and help clinicians to learn about the way the system makes the decisions. As a result, the proposed pipeline provides a system that is understandable and privacy-conscious, increasing transparency and trust and enabling more dependable skin cancer diagnosis in a real-world healthcare setting.
Dataset description with preparation
In this paper, the HAM10000 (Human against Machine with 10,000 training images) dataset is used as the primary dataset for multi-class skin lesion classification. The dataset contains over 10,000 dermoscopic figures collected from various medical sources. Clinical sources and populations, making it one of the most widely used benchmark datasets in dermatological image analysis. Each image in the dataset is accompanied by important clinical metadata, including image identifiers, diagnostic labels, patient age, sex, and the anatomical location of the lesion. The dataset covers seven diagnostic categories: actinic keratoses (akiec), basal cell carcinoma (bcc), benign keratosis (bkl), dermatofibroma (df), melanocytic nevi (nv), vascular lesions (vasc), and melanoma (mel).
Clinical metadata preprocessing
Auxiliary features added to the classification pipeline included clinical metadata, such as age, sex, and the lesion's location in the patient. There were missing or unknown values, which were treated through a deterministic preprocessing approach. In the case of the age variable (numerical), the median age calculated on the training set was used to impute the missing values. The reason median imputation was chosen is that it is resistant to outliers and skewed data, which are prevalent in clinical data. For sex and lesion location (categorical variables), missing or unspecified values were not excluded; they were assigned to a special category labeled 'unknown'. The method maintains all available samples, and the model is free to determine whether missingness itself is predictive. One-hot encoding was then applied to categorical variables to enable them to be compatible with machine learning models. All preprocessing, such as imputation, encoding, etc., was only done on the training set, and the same transformations were done to the validation and experiment sets to avoid data loss. There were no samples excluded just because of missing clinical metadata, and this ensured that the data was maximally utilized, and there was methodological consistency.

Figure 1: Multimodal system for skin lesion classification. The study approach combines dermoscopic image features with patient metadata to classify skin lesions using ensemble deep learning models. The framework includes preprocessing, feature extraction, multimodal fusion, and classification, allowing for enhanced diagnostic performance and interpretability. Please click here to view a larger version of this figure.
The workflow depicts the suggested classification pipeline, based on dermoscopic images and clinical metadata of the HAM10000 skin lesion dataset. EfficientNet-B4, DenseNet201, and MobileNetV2 are used to preprocess and extract deep features in images. The clinical metadata are coded, and feature fusion is used to combine the image features with the clinical metadata. In order to address the issue of class imbalance, the class-balancing technique is used in the fused multimodal feature space instead of the raw images or individual feature streams, where synthetic samples maintain the combination of both the visual and clinical features and do not produce unrealistic samples. The merged features are then trained on classifiers such as XGBoost, LightGBM, and a deep neural classifier.

Figure 2: Example dermoscopic images from seven different diagnostic groups from the HAM10000 dataset. Images show typical visual features used for automated classification. (A) Actinic keratoses (akiec), demonstrating rough surfaces with irregular pigmentation. (B) Basal cell carcinoma (bcc), with irregular shapes and blood vessels. (C) Benign keratosis-like lesions (bkl), showing keratotic features with light brown surfaces. (D) Dermatofibroma (df), with a central scar-like appearance and pigmentation. (E) Melanocytic nevi (nv), benign and relatively symmetric moles. (F) Vascular lesions (vasc), showing a reddish-purple appearance due to blood vessels. (G) Melanoma (mel), which presents as an irregularly shaped, asymmetric, and multi-pigmented lesion. Please click here to view a larger version of this figure.
These dermoscopic images reveal the visual heterogeneity of skin lesions, which have variations in pigmentation, texture, and morphology of the structure. These variations pose a great challenge to automated classification systems and stress the significance of deep learning-based systems. Feature extraction techniques that are sensitive to revealing subtle diagnostic patterns. Following the dataset description, Figure 2 illustrates the seven categories of skin lesions included in the HAM10000 dataset, which are commonly studied in dermatological diagnostic imaging research. These classes include Actinic Keratoses (akiec), Basal Cell Carcinoma (bcc), Benign Keratosis (bkl), Dermatofibroma (df), Melanocytic Nevi (nv), Vascular Lesions (vasc), and Melanoma (mel)21. All these types of lesions have unique visual features, as shown in Figure 3, which include variation in pigmentation patterns, surface texture, color distribution, and abnormalities along the lesion borders. The visual characteristics of all these lesions are different, and they are characterized by variation in patterns of pigmentation, surface texture, color distribution, and abnormalities on the borders of the lesions. These are important characteristics that dermatologists would have in mind when conducting the clinical examination, and therefore have to be well modeled by machine learning models in order to attain the right classification. Even though these are the differentiating characteristics, many of these lesions appear virtually identical, which makes it difficult to differentiate between them when looking at merely dermoscopic images. The distinction between certain types of lesions is typically extremely subtle but clinically pertinent, making it challenging to classify automatically. This is why it is urgent to create potent AI models capable of training to learn fine-grained visual images and subtle differences in lesions among lesion classes. These properties will not only be enhanced by the appropriate description, which will result in the improvement of the discriminative skills of the model with different types of lesions, but also help to diagnose some perilous conditions, such as melanoma, earlier. Lastly, it can enhance the diagnostic accuracy, inform clinicians in making decisions that result in improved patient outcomes, and help make better decisions.

Figure 3: Class-wise distribution of skin lesions in the HAM10000 dataset. The figure shows the distribution of the seven lesion categories considered in this study: Actinic Keratoses (akiec), Basal Cell Carcinoma (bcc), Benign Keratosis-like lesions (bkl), Dermatofibroma (df), Melanocytic Nevi (nv), Vascular Lesions (vasc), and Melanoma (mel). This graph illustrates the class imbalance of the lesion classes. Please click here to view a larger version of this figure.
The analysis of the dataset shows that there is an imbalance in the classes of the different types of lesions. The most common type of Melanocytic Nevi (nv), with approximately 6,705 samples, is the most common type, followed by Melanoma (1,113) and Benign Keratosis (1,099). On the contrary, there are some forms of lesions of clinical significance that are significantly less represented, such as Dermatofibroma (115) and Vascular Lesions (142). This disproportion poses a threat to machine learning models because they may have a tendency to be biased towards the majority classes and are incapable of having the potential to detect unusual but clinically significant lesions. To deal with this issue and improve the training of the model on the model performances with respect to all the classes, advanced preprocessing is required. Strategies are needed. These include techniques like targeted data augmentation and class balancing. The data can be balanced using the technique (Class balancing technique and class weight adjustment which encourages the model to discover substantial trends in the underrepresented classes. The hyperparameters used for XGBoost and LightGBM were primarily set to their default configurations, with minor adjustments based on preliminary experiments. For the deep neural classifier, architectural and training parameters such as the number of layers, neurons, learning rate, batch size, and number of epochs were selected empirically using validation data. The complete set of hyperparameters is provided in Table 3. In general, the number of dermoscopic images utilized in the present study is 10,015 altogether. This has the benefit of providing a vast collection of data to be trained and tested, and it is a tedious yet rewarding yardstick as well. Appraise the effectiveness of the proposed skin lesion classification system.
Data preprocessing
The preprocessing pipeline prepares the HAM10000 dataset for multimodal learning by standardizing images, extracting deep features, integrating clinical metadata, and addressing class imbalance.
Image Standardization: All dermoscopic images were resized to 224 × 224 pixels and normalized using z-score normalization.
(1)
Where I represent the raw image, µ denotes the pixel-wise mean, and σ is the standard deviation.
Deep Feature Extraction: Complementary deep features were extracted using three pre-trained convolutional neural networks: Efficient-Net B4, DenseNet201, along with MobileNetV2. Each network maps the normalized image to a feature vector.
(2)
The extracted features were concatenated to form a unified representation:
FFusion=FEffB4 ||FDense ||FMobV2 (3)
(where || means concatenation)
Clinical Metadata Integration: Clinical attributes, including age, sex, along with lesion localization, were cleaned, label encoded, and normalized using min-max scaling:
(4)
The processed metadata vector Mclinical was fused with image features to construct the final multimodal input:
Fcombined=FfusionMclinical (5)
Dataset Splitting: A stratified split was applied to preserve class distribution
Dtrain,Dtest=Split(Fcomibed,0.8) (6)
Class imbalance handling: The HAM10000 data set has a severe imbalance of the classes, where” nevus” (NV) samples prevail as underrepresented in other minority groups, like DF with VASC. In order to reduce this problem, the” Synthetic Minority Oversampling Technique” (class balancing technique) was employed. Using: New synthetic samples were produced as:
xnew=xi + λ(xzi - xi) (7)

Where xi is a minority class sample, xzi is one of its nearest neighbors, and λ is a random value sampled from a uniform distribution between 0 and 1. The synthetic sample, as shown in Figure 4, is generated along the line segment joining x sub i. and xent joining xi and xzi.

Figure 4: Class distribution in the HAM10000 dataset before/after applying the class balancing technique. (A) Before class balancing, with imbalance across lesion classes. (B) After class balancing in the combined feature space, where the representation of all classes is equal to avoid bias in the classifier training process. Please click here to view a larger version of this figure.
To address the issue of class imbalance in the HAM10000 dataset, the Synthetic Minority Over-Sampling Technique (class balancing technique) is applied. Class balancing technique generates synthetic samples for the minority classes by interpolating between existing data points, which helps increase the representation of underrepresented lesion categories. The end result of producing more examples of these minority classes is a more balanced dataset overall, with respect to all seven lesion types. This balanced representation will enable the classification models to learn better with every class and minimize the bias with the majority classes. Consequently, the model is fairer in classification and sensitive, especially to rare, yet clinically important skin lesions.
Privacy-preserving learning framework
The suggested system proposes a multimodal system of automated lesions classification on the skin that is privacy-aware and interpretable. The ultimate aim of the system is to enhance the diagnostic performance and at the same time, safeguard sensitive patient information throughout the training process. Patient privacy is an essential need in medical practice because healthcare data privacy laws and ethical considerations are highly important in healthcare settings. Thus, the suggested model will include a decentralized learning model that is based on the ideas of federated learning. In this decentralized environment, model training is accomplished on a group of distributed clients instead of aggregating all patient data in a centralized location. All participating clients train the model locally on their own data, and raw patient data do not leave the local environment. As an alternative to moving sensitive medical records, model updates or parameters are sent to a central server to be aggregated. This cooperative approach to learning enables the various institutions or sources of data to contribute to model training without compromising on data privacy.
Let wt(k) be the model parameters of the kth client at the tth iteration, and let nk be the sample size at that client. The update of the global model is calculated as:
(8)
This aggregation strategy ensures that clients with larger datasets contribute proportionally more to the global model while still allowing smaller clients to participate in the learning process. By enabling collaborative training without exchanging raw patient data, the proposed framework maintains privacy while still benefiting from distributed knowledge across datasets.
Federated experimental setup
A simulated federated learning system with the HAM10000 dataset was designed to confirm the efficiency of the offered privacy-aware framework. The data was divided into three clients to simulate a real-life multi-institutional environment with non-identically distributed (non-IID) data. Every client has a varying mix of lesion classes, and it represents a variation in the world between clinical centers. The identical multimodal feature extraction pipeline (EfficientNet-B4, DenseNet201, MobileNet V2, and clinical metadata) was locally run at every client. In their training, clients updated their local models on their own, and the learned parameters were only exchanged with the central server to be aggregated by the FedAvg algorithm. The trade-off between predictive accuracy and privacy was compared between the federated model and the centralized training approach to measure the performance of each. Test outcomes indicated in Figure 5 shows that the federated model can perform competitively, with only a slight decrease in accuracy relative to centralized learning, and much improved data privacy.

Figure 5: Client-wise distribution of the HAM10000 dataset. This shows the allocation of skin lesion data among clients, demonstrating the diversity in data distribution. This demonstrates the heterogeneity of data among clients, a critical aspect of federated learning. Please click here to view a larger version of this figure.
Heterogeneous (non-IID) distributions of clients formed in HAM10000 were divided into three groups to model real-life clinical conditions. The distribution of different categories of lesions within each client is different, especially the class of nevus (nv), which is not evenly distributed across clients. This arrangement is indicative of the real-world difficulties of federated learning, in which data in institutions are not evenly distributed.
Performance comparison: centralized vs federated learning
To evaluate the effectiveness of the proposed federated learning framework, a comparative analysis was conducted between centralized and federated training strategies using the HAM10000 dataset, as shown in Figure 6. In the centralized setting, all data samples were aggregated into a single training pool. The best-performing centralized model, the stacked ensemble, achieved an overall accuracy of 96%. In contrast, the federated setting distributed the dataset across three clients with non-identically distributed (non-IID) data, where each client trained the model locally and shared only model parameters using FedAvg. The federated model achieved an overall accuracy of approximately 94%, corresponding to a performance difference of 2% compared to the centralized approach, as shown in Table 4. This marginal decrease is expected due to decentralized optimization and heterogeneous data distribution across clients.
Even though this small change happened, the federated model still did well at predicting. In centralized training, class-wise behavior shows that the majority of classes, like nevus (nv) (F1-score = 1.00), stay stable, while minority classes, like dermatofibroma (df) (F1-score ≈ 0.65–0.66), are more sensitive to distribution imbalance, which could affect federated performance even more. Notably, the federated structure minimizes the chances of exposing sensitive patient information since it does not require the sharing of raw medical data among clients.

Figure 6: Federated learning vs. centralized learning comparison. This figure compares learning paradigms using performance metrics such as accuracy, precision, recall, and F1-score. This demonstrates the capability of federated learning to achieve performance comparable to that of the traditional learning approach while preserving privacy. Please click here to view a larger version of this figure.
The Table 4 results indicate that the federated learning model is capable of being competitive, and the drop in accuracy is only by a slight amount of approximately 2% compared to the centralized one. This slight reduction can be explained by the decentralized optimization and non-IID data distribution. However, the federated model has a tremendous advantage as far as privacy protection is concerned, as the sensitive patient information is not shared among the clients. To provide a fair comparison of the federated model and the centralized stacked ensemble model, the federated model was tested with the same architecture and hyperparameters. The privacy-preserving aspect discussed in this study is conceptual and intended to highlight the potential integration of techniques such as federated learning in future work. No experimental validation of privacy-preserving mechanisms is performed in the current implementation.
Multimodal feature fusion
The diagnosis of skin lesions usually includes skin observation and clinical history. Dermatologists, in most cases, do not only consider dermoscopic images by placing them in relation to the patient information (age, sex, and location of the lesion) to make their diagnostic judgments. The proposed system is based on the inspiration of this clinical workflow and incorporates a multimodal approach to learning to combine image-based and clinical data. CNNs are trained on pre-existing dermoscopic image deep features. Such networks recognize intricate visual designs, including color changes, lesion forms, structural anomalies, and texture features. Nevertheless, the features of images might not be sufficient to capture the clinical situation of a lesion. Clinical metadata related to every image is thus also included in learning. A feature fusion module will be created that will integrate deep image features with processed clinical attributes and demographic information. This composite representation constitutes an integrated multimodal feature representation that consists of both visual and contextual information of every lesion. The model can integrate several data sources to obtain complementary patterns that enhance overall classification ability. The multimodal representation allows the system to more effectively differentiate between visually similar lesions as well as factor in the clinical indicators. The model is more clinically meaningful and effective as it is a closer approximation of how dermatologists study lesions in clinical practice.
Stacked ensemble learning
The proposed framework uses a stacked ensemble learning strategy to further improve the predictive ability of the system. Ensemble learning is a composite method of predicting that uses two or more predictive models to enhance generalization and minimize the errors of prediction that can occur with single models. Multiple base learners are independently trained on the multimodal feature representation rather than using a single classifier. All base learners provide an estimate of how likely a particular sample is to be of a particular lesion class. These probability predictions are then aggregated at a meta-level. A weight is assigned to each base learner to show its relative importance to the end prediction. A softmax activation function is used to calculate the aggregated output to generate normalized class probabilities. The stacked ensemble method has a number of benefits. First, it minimizes prediction variance due to the combination of various models and thus enhances the performance of the generalization. Second, it enhances strength since various models describe various trends in the data. Third, ensemble learning enhances the classification of minority lesion classes, especially in medical data, where certain conditions of clinical interest are not as prevalent.
Explainable artificial intelligence integration
Medical AI systems should also offer clear explanations of their choices, even though high prediction accuracy is critical. To place trust in AI systems and be effective in their practice, clinicians should be able to comprehend how a model fits to the diagnosis it produces. In order to meet this need, the proposed framework incorporates explainable artificial intelligence (XAI) methods, as depicted in Figure 7.

Figure 7: Confusion matrices of different classification models for multi-class skin lesion classification. (A) XGBoost, (B) LightGBM, (C) Deep Neural Classifier, and (D) Stacked Ensemble model. Each confusion matrix shows the relationship between the true class (rows) and the predicted class (columns) for all seven types of skin lesions: akiec, bcc, bkl, df, mel, nv, and vasc. The XGBoost and LightGBM models perform well for the nv and bkl classes, though there is some confusion between mel and nv. The Deep Neural Classifier improves the classification of bkl and df and decreases off-diagonal confusion. The Stacked Ensemble model shows the greatest classification consistency, with the diagonal becoming increasingly dominant. Please click here to view a larger version of this figure.
The system includes two popular explainability approaches (model interpretability technique (SHapley Additive Explanations) and model interpretability technique (Local Interpretable Model-agnostic Explanations)) to give an insight into what the model predicts. The model interpretability method explains features at the level of features by measuring the extent to which each input feature has contributed to the overall prediction. It assists in determining which clinical variables/ visual qualities have the most impact on the result of the classification. This enables researchers and clinicians to see the model's overall behavior across the dataset. Model interpretability technique, on the other hand, deals with local explanations of individual predictions. It emphasizes the areas of the dermoscopic image that have the greatest impact on the model's decision. These pixel-level visual explanations enable clinicians to visually inspect the areas of the lesion that informed the classification. The proposed framework offers global and local interpretability; it is achieved by integrating the model interpretability technique. The dual-explanation mechanism enhances transparency and enables clinicians to assess whether the model is targeting medically significant patterns.
Clinical decision support potential
Privacy-preserving learning, multimodal feature fusion, ensemble modeling, and explainable AI are key components of an integrated and robust system for automatic skin lesion classification. Ideally, the system should not only have high prognostic power, but also be transparent and secure, which are two key factors in medical systems, as shown in Figure 8.

Figure 8: Receiver operating characteristic (ROC) curves for the stacked ensemble model. (A–C) This shows the ROC curves for the seven skin lesion types, with true positive rate (sensitivity) and false positive rate (1-specificity). The area under the curve (AUC) represents the performance of the stacked ensemble model in discriminating between the classes. Please click here to view a larger version of this figure.
This system provides explainable predictions and privacy protection. As a result, it is a beneficial system for other dermatological diagnostic systems. This system allows health practitioners/ dermatologists to assess lesion suspiciousness and improve diagnostic accuracy and, as a result, help practitioners/ dermatologists to diagnose patients at an early stage when they may have a more serious disease (e.g., melanoma). In essence, as shown in Figure 9, this system seeks to bring the technologies of using high-tech artificial intelligence (AI) systems and implementing real-world applications into practice, to help dermatologists diagnose patients more accurately and with more confidence while ensuring the privacy and security of patients and their comfort.

Figure 9: Explainability results using model interpretability techniques for multi-class skin lesion classification. (A) SHAP plot showing feature contributions influencing benign and malignant lesion predictions. (B) LIME explanation for the bcc prediction, illustrating the features contributing positively and negatively to the classification outcome. (C) LIME explanation for the akiec prediction, highlighting the most influential features involved in the model decision-making process. These interpretability visualizations demonstrate the regions and extracted features that significantly affect the model’s predictions, improving transparency and understanding of the classification process in skin lesion assessment. Please click here to view a larger version of this figure.
Evaluation strategy
To avoid sampling bias and maintain the original class distribution across all skin lesion categories, the dataset was split into an 80:20 train–test split. The training subset was then split in the ratio 90:10 train: validate, to tune the hyperparameters and optimize the model. The test set was not used in the training process at any stage and was only applied at the end of the training process as a final test to avoid leakage of data and ensure an unbiased performance assessment. All models were pre-processed and trained in equal settings, data was partitioned and augmented in the same way, and evaluation protocols were applied and followed in the same manner, which allowed for fair and reproducible comparisons. The models were thoroughly evaluated based on accuracy, precision, recall, F1 score, and AUC, with a detailed analysis of the class-wise results to determine their robustness for both major and minority classes of lesions. This standardized validation tool would help to increase the reliability, transparency, and generalizability of the proposed approach, and overcome the potential inconsistencies in performance reporting.