Research Article

A Computational Intelligence-based Early Diagnosis of Asthma Disease: A Saudi Arabian Case Study

DOI:

10.3791/70040

June 16th, 2026

In This Article

Summary

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The current study investigates several machine learning algorithms on a Saudi Arabian dataset for asthma detection. In contrast to state-of-the-art approaches that employ machine learning on clinical datasets for asthma diagnostics, the proposed scheme achieves better performance, with a 3.9% improvement in diagnosis accuracy.

Abstract

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

According to research, an increase in chronic diseases, including asthma, has been identified. The number of asthma patients in Saudi Arabia is a cause for concern due to the weather conditions and lifestyle, especially in the post-pandemic era. It demands a solution to reduce infections by developing an intelligent system to detect asthma at an early stage, thereby preventing the disease or enabling early treatment. In this study, machine learning has been utilized to develop tools to track asthma symptoms at an early stage. Although there have been various prior attempts to apply machine learning to predict the occurrence of asthma. Nevertheless, focusing on the identification of the disease at the pre-symptom stage, particularly in the Saudi Arabian context, is a relatively neglected area. The dataset for the current study was obtained from King Fahad University Hospital, Dammam, Saudi Arabia, and included standard tests performed on patients, including blood tests, viral tests, and biochemistry tests. The dataset contains 17 significant attributes and includes information for 328 asthma patients: 165 are positive, and 163 are negative. The methods selected for application here are random forests (RF), artificial neural networks (ANN), support vector machines (SVM), and naive Bayes (NB). Each of these methods has been chosen based on its distinctive features. The experimental outcome revealed that the RF, SVM, and ANN approaches yielded 94%, which is the highest accuracy and improved upon the state of the art by 3.9%. It is worth noting that nine of the seventeen possible features were used to achieve the above accuracy. Despite RF, SVM, and ANN achieving the same accuracy, ANN has a higher error cost. Therefore, RF and SVM are superior based on the pattern of results, and hence they are suggested for this problem.

Introduction

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

After conducting extensive research, the researchers observed that chronic diseases have become increasingly common1. Asthma is a chronic inflammatory disease of the airways characterized by recurrent episodes of breathlessness and wheezing. The prevalence of asthma varies globally, ranging from 1–20% for both children and adults, affecting millions of people worldwide2. These large-scale changes are related to environmental variations, including those in various countries, as well as the use of different assessment instruments and diverse epidemiologic definitions of asthma. Asthma has a relatively low fatality rate in contrast to several other well-known chronic diseases2. However, reports state that the asthma pattern in the Kingdom of Saudi Arabia is increasing3,4.

Asthma is a chronic inflammatory condition that results in narrowed lumen. The narrowing of the aviation route is brought about by the expansion of bodily fluid emission, as well as bronchial divider thickening due to edema, subepithelial fibrosis, and smooth muscle hypertrophy. Pathophysiological apparatuses driven by various cell types, including immune cells, have been utilized to assess these conditions5. Due to severe weather conditions and mainly an indoor living lifestyle in Saudi Arabia, asthma is among the most widespread chronic diseases. Several surveys have demonstrated the overwhelming rate of asthma6. The disease has become a significant issue, requiring an early response system to help mitigate the number of incidents and make it feasible for early mediation in forestalling or curing the disease at an earlier stage.

In contrast to individuals, asthma has a profound impact on communities and families. If not controlled, it imposes harsh limitations on the patient's life and prevents them from engaging in many daily activities. In Saudi Arabia, harsh weather conditions, including a hot atmosphere, widespread sandstorms, and excessive use of indoor air conditioning/cooling systems, are significant contributors to asthma. The Saudi thoracic society (STS) has made noticeable efforts to provide the best medications for respiratory infections7. However, a proactive diagnosis of asthma is needed to reduce the infection rate, especially in a post-pandemic (COVID-19) scenario. It was further deduced that family history, smoking, and allergic rhinitis are among the most significant risk factors for developing asthma among Saudi adults. Additional research was recommended to investigate other factors that set the basis for the study.

In this research, considering previous studies on asthma diagnosis, the aim is to enhance diagnostic accuracy. The proposed techniques, including random forest (RF), artificial neural network (ANN), support vector machine (SVM), and naive Bayes (NB), have been utilized in earlier studies, yielding noticeable improvements in diagnosis results. For example, researchers have used ANN, SVM, and NB in studies8,9,10. In the current study, machine learning approaches have been investigated to assist as an early warning system that detects even the small indicators of asthma disease at the earliest stages (pre-symptomatic stage). The prevalence of asthma disease in Saudi Arabia, especially in the post-pandemic era, with factors like an indoor-oriented lifestyle due to severe weather conditions, air conditioning ducts, and sandstorms are among the major factors. However, there is no noteworthy study investigating the crucial factors that have been witnessed in the recent past. To address this research gap, the current study aims to investigate the role of machine learning in the early detection of asthma among Saudi adults. Furthermore, the primary objective is to achieve the highest possible accuracy with the fewest features. Although there have been recognizable efforts in the past to utilize machine learning to predict the presence of asthma disease8,9,10. The current study aims to utilize a Saudi Arabian dataset, obtained from a public sector hospital in the eastern province for the first time, focusing on diagnosing the disease at an early stage (preemptive diagnosis). Machine learning (ML) is recognized as a method for extracting knowledge from collected data11,12. It provides prediction accuracy obtained from the learning process of earlier instances by using various machine learning techniques. ML encompasses multiple methods, algorithms, and strategies to address analytical and predictive problems in broad medical fields, such as early detection of disease existence13,14.

Several studies have been conducted to predict asthma disease using various techniques. Researchers developed an artificial neural network (ANN) for asthma classification depending on the dynamic and static patient tests8. The measured parameters of spirometry, impulse oscillometry (IOS), auscultation, allergy results, and information on symptoms are utilized in the ANN. The defined information enables the ANN to suggest the appropriate asthma classification. Likewise, researchers conducted a study to address the challenges in diagnosing chest diseases7. Support vector machines (SVM) and the adaptive SVM (ASVM) methods were used to diagnose chest diseases such as asthma. The best classification accuracy for all chest diseases was obtained utilizing the ASVM method. The researchers used several neural network structures, such as multi-layer NN (MLNN) with backpropagation algorithm (BPA) with one and two hidden layers, probabilistic NN (PNN), learning vector quantization (LVQ), and general regression NN (GRNN), for chest disease diagnosis, including asthma disease15. Additionally, some conducted a comparative study aimed at diagnosing chest diseases using an artificial immune system (AIS). The mentioned studies have implemented various structures for diagnosing chest disease problems using their different datasets. For asthma, the highest result was achieved using AIS with an overall accuracy of 90.91%16. Moreover, researchers investigated the efficacy of a deep NN (DNN) for improving accuracy in asthma diagnosis and compared the prediction performance of different machine learning algorithms, particularly with logistic regression and SVM. The study showed that the empirical test utilizing the model of asthma signs was more effective for the accurate diagnosis of asthma9. Another study demonstrated the significant capability of machine learning utilizing tele-monitoring data to predict asthma exacerbations before they occur through chosen machine learning approaches such as SVM and Naïve Bayes to implement their experiment10.

Other than that, researchers employed several machine learning techniques to preemptively diagnose Alzheimer's disease (AD), including SVM, k-NN, adaptive boosting (AdaBoost), and eXtreme gradient boosting (XGBoost)17. Researchers conducted a study to preliminarily diagnose diabetes as well, by investigating four machine learning approaches, namely NB, SVM, K-NN, and ANN. The top three features of the Saudi Arabian dataset exhibited the highest average accuracy of 68%. The performance of the measuring techniques was compared based on accuracy, precision, recall, and F1-score. The ANN obtained the highest classification accuracy of 77.5%18. Likewise, the same group of researchers experimented with the performance of four classification techniques— namely, NB, SVM, K-NN, and ANN— to preemptively diagnose chronic kidney disease, applied to the Saudi Arabian dataset19. Moreover, the authors conducted a study to primitively diagnose thyroid cancer using four machine learning techniques: ANN, SVM, RF, and NB. Utilizing 14 features of the Saudi Arabian dataset, the authors obtained the highest average accuracy of 95%. The performance of measuring techniques was compared using accuracy, precision, recall, and F1-score. The RF obtained the highest classification accuracy of 90.91%20. The researchers also conducted two studies that yielded remarkable results in the proactive diagnosis of rheumatoid arthritis. Several machine learning techniques, such as SVM, logistic regression (LR), and AdaBoost, were shortlisted and used as candidate techniques for a voting ensemble. Initially, one data set was used, and the other was balanced by applying SMOTE. The novel voting ensemble technique was tested by two sequential feature selection methods. Namely, forward and backwards selection are used to find the best subset of the feature space that presents the highest indicators for each technique. Using 30 features of the original data and the sequential forward feature selection technique, the obtained percentages were promising, with accuracy, recall, and precision values of 94.03%, 96%, and 93.51%, respectively21.  Similarly, other intelligent methods in medicine and related fields can be found in many studies22,23,24,25,26.

The proposed techniques in this work are RF, SVM, ANN, and NB due to their outstanding results for similar problems. In the current study, the results obtained through the experiments. After various optimization steps and feature selection, it is demonstrated that the proposed techniques are promising for asthma diagnosis with the targeted dataset.

Protocol

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

This section describes the machine learning algorithms investigated in the proposed asthma diagnostics approach. The criteria and reasons behind the selection of the proposed method are based on a comprehensive literature review and a long history of dealing with preemptive diagnosis of several chronic diseases27,28,29,30,31. These algorithms produced promising results. All the materials and tools used in this study are listed in the Table of Materials.

Support vector machines (SVM)
The SVM algorithm is among the most successful algorithms in pattern recognition and classification32. It uses binary classification, where it classifies a training set of vectors into two classes. It belongs to the family of supervised learning algorithms, where the desired output is generated under supervision32. In comparison with other classification algorithms in machine learning, SVM provides better results for classification performance. Therefore, it has been used extensively in numerous applications such as speech, face, and handwriting recognition. Furthermore, SVM has also been applied in various areas, including disease prediction and stock prediction. Additionally, due to its insensitivity to noise or overfitting, SVM can handle unbalanced data, which is a typical case in medical diagnostics, where positive cases are fewer in number than negative cases32.

In classification, SVM separates two classes by drawing a decision border, in which it can forecast the labels by distinguishing one or more feature vectors32. The decision boundary is referred to as the hyperplane, which is the farthest from the closest data points of each class. Those data points are referred to as support vectors. Figure 1 shows some candidate hyperplanes in linearly separable data. Equation 1 demonstrates the hyperplane equation, while equations 2 and 3 show the elements located above and below the plane32:

Linear equation w^T x + b = 0, illustrating decision boundary concept in machine learning. hyperplane equation   (1)

Here, w is a vector representing weight, x is a vector representing the input, and b. represents a potential bias.

Linear classification equation, w^Tx+b≥0, formula representation for machine learning models. For all j, such that Static equilibrium diagram with ΣFx=0, MA=0; depicting forces in balance and moments analysis. = +1   (2)

Linear classification equation, wx^T + b < 0, mathematical function, algorithm analysis. For all i, such that Static equilibrium diagram with ΣFx=0, MA=0; depicting forces in balance and moments analysis. = -1  (3)

Support Vector Machine diagram with hyperplane, margin, equations wx+b=0, wx+b=±1; data classification.
Figure 1: A typical SVM with two separable classes. This figure visualizes a typical SVM with two separable classes. Abbreviations; SVM = support vector machine. Please click here to view a larger version of this figure.

Artificial neural network (ANN)
ANN is a computational intelligence technique in soft computing that mimics the functionality of the human brain system. It has a vast number of interconnected neurons33. It is among the supervised machine learning algorithms that can learn from example datasets and improve their performance through learning, generating useful outputs33. ANN's architecture comprises multiple layers: an input layer, a hidden layer, and an output layer. Each of them is formed by a set of connected neurons33.

The neuron that receives the data from outside performs some processing and then sends the data to another layer known as the input layer. The purpose of the input layer is not to perform any complex process on the data; it is simply to duplicate the input value and send it to the next layer33. The output layer contains neurons that produce data for the user program. Between these two layers, the hidden layer is located as an optional layer to perform computational processes. The hidden layer serves as an intermediate layer that receives inputs from the input neurons, performs mathematical operations on them, and then passes the results to another layer33. Figure 2 shows the ANN architecture.

Neural network diagram; input, hidden, output layers; illustrates back-propagation process.
Figure 2: Feed-forward and backpropagation in a typical ANN structure. This figure visualizes a feed-forward network with backpropagation in a typical ANN structure. Abbreviations; ANN = artificial neural network. Please click here to view a larger version of this figure.

The feed-forward NN is an approach that stores input data in a forward direction. On the opposite direction, the back-propagation process occurs as the neural network weights update in a backward approach to obtain the gradients, using a neural network with several hidden layers. The drawback of this process is the vanishing or exploding of gradients. The activation function maintains the nonlinear characteristics of their ANN, which enables it to learn detailed relations between input and output data, as declared a universal approximation. Figure 3 demonstrates the main architecture of the artificial neural network. It also illustrates the activation function applied to every neuron's output, which represents the sum of all input weights. The bias holds the sum of all weights and is included in the neuron input33.

Neuron diagram, bias and activation function, weights labeled for deep learning model architecture.
Figure 3: A typical single perceptron neuron for an ANN. This figure depicts a single perceptron neuron of a typical ANN. Abbreviations; ANN = artificial neural network. Please click here to view a larger version of this figure.

Random forest (RF)
RF is among the prominent classification algorithms in machine learning. It consists of several individual Decision Trees working together as an ensemble and producing the final output29. The basic concept behind the successful performance of the RF is that it is made up of many moderately uncorrelated trees (forming a forest), which act as a committee. Thus, they will be more efficient than all models that run independently34. The RF algorithm was primarily developed by a group of researchers35. To better understand how RF works, it is essential to understand decision trees35. It is equally applicable to both discrete and continuous problems, specifically classification, detection, and regression, respectively36. The more trees in the forest, the closer to accuracy the result will be, but with an added computational complexity36, as depicted in Figure 4.

Tree structure algorithm diagram for data aggregation process, with nodes and directional arrows.
Figure 4: Random Forest schematic. This figure demonstrates a schematic of a typical random forest. Please click here to view a larger version of this figure.

Naive Bayes (NB)
Naïve Bayes (NB) is a classification algorithm that employs the Bayes theorem to classify objects. It is a very popular method applied in most medical fields, especially for diagnosing symptoms. In this method, objects are taking the principal independence assumptions, which makes the relation between the attributes independent of each other. Due to its naïve nature, it is among the simplest classification algorithms in machine learning37. Based on the given dataset, NB determines the likelihood of a particular output. The NB model is simple to develop, can manage substantial data, and is a sophisticated and computationally light algorithm. Additionally, it delivers modest functions, resulting in numeric and nominal knowledge. Robustness and simple learning interpretations are also the potential advantages of the NB classifier. If it is employed on a limited amount of data, it will result in a relatively inaccurate outcome. It depends on enormous records, which are considered less precise compared to other techniques37.

Statistical analysis
Statistical analysis of the dataset helps visualize and understand the pattern of the dataset for better data pre-processing and modeling. In this regard, the following steps were carried out. Firstly, to investigate if the demographic features are imbalanced between positive and negative groups, including age and gender, a statistical analysis is performed on SPSS software. Secondly, compile the quantitative data using an independent sample t-test if the normal distribution and homogeneity of variance were satisfied, or the Man-Whitney U test. Finally, the categorical data has been compiled using chi-squared test or Fisher’s exact test38.

Implementation
Dataset description
The dataset for the current study was obtained from King Fahad University Hospital in Dammam, Saudi Arabia, following approval by the institutional review board (IRB). The said data was collected according to standard tests among the patients. asthma is often diagnosed through standard tests among patients, including blood tests, virus tests, and biochemistry tests. For the current study, the patients and healthy controls (HC) were recruited and clinically evaluated based on the standard triage and pathological tests conducted at the hospital under the recommendations of physicians in the department. The clinical inclusion and exclusion criteria were defined by the doctors based on the laboratory results. Subsequently, the qualified and verified data were provided for the study. The dataset contains seventeen attributes, as they are available for all asthma patients. The dataset includes 328 total instances with 165 asthma-positive and 163 asthma-negative instances. The distribution of gender and age in the positive and negative groups is listed in Table 1.

InstancesPositiveNegative
Male14510738
Female18357126
Total328164164

Table 1: Dataset gender distribution. This table shows the gender distribution in the dataset.

Table 2 shows the age distribution between the two classes.

Age RangeAge Distribution (Class = 1)Age Distribution (Class = 0)
1–100.0591130
11–200.1133010.060869
21–300.0837430.177947
31–400.1576320.164912
41–500.177340.164912
51–600.1970440.099566
61–700.1083740.047619
71–800.0788170.025974
81–900.0197040.012987
91–1000.0049260.012987

Table 2: Dataset age distribution. This table shows the age distribution in the dataset.

The age (p < 0.001 in Man-Whitney U test) and gender (p < 0.001 in the Chi-squared test) were unbalanced between the positive and negative groups. However, there is no bias in the recruitment process. It is believed that the unbalanced distribution may reflect the unique age distribution of this patient cohort. Inclusion criteria include clinical factors for the disease presence and absence; demographics, including gender, age, and local population, are considered. Moreover, the data was collected under informed consent. Age and gender are included as potential features in the feature set, as mentioned below. However, explicit age and gender based analyses are not in the scope of the study and left as future work.

The dataset is stored as numeric values. The "Class" column contains the values 0 and 1, where 0 represents the "Normal patient" and 1 represents the "Asthma patient".
In this regard, the following seventeen attributes were used:
Class: (0 = “Normal”, 1 = “Asthma Patient”)

1. age in years (AGE)

2. gender (1 = male, 0 = female): GENDER

3. Basophils (BASO)

4. Eosinophils (EOS)

5. Hematocrit (HCT)

6. Haemoglobin (HGB)

7. Lymphocytes (LYM)

8. Mean corpuscular haemoglobin (MCH)

9. Mean corpuscular haemoglobin concentration (MCHC)

10. Mean corpuscular volume (MCV)

11. Monocytes (MONO)

12. Mean platelet volume (MPV)

13. Neutrophils Function (NEUT)

14. Platelet count (PLATELET_COUNT)

15. Red blood cell count (RBC)

16. Red cell distribution width (RDW)

17. White Blood Cell (WBC)

Statistical features of the dataset
For better understanding, a statistical analysis of the dataset is depicted in the subsequent tables38. Table 3 shows the mean, Median, Standard Deviation, Maximum, and minimum values for the dataset under consideration.

MeanMedianStandard deviationMaxMin
AGE39.13618.136932
GENDER0.44200.497410
BASO0.070.070.07010.720
EOS0.2310.2290.204820
HCT40.8840.76.031356.53.4
HGB13.6713.454.344681.35.9
LYMP2.5952.332.184333.40.4
MCH26.78273.317734.52.6
MCHC32.71332.101743.115
MCV81.1582.89.442610213
MONO1.1890.6136.1198950.2
MPV9.2179.2171.729713.40
NEUT4.234.232.3074160.6
PLATELET_COUNT279.7272.585.4736850
RBC5.677511.6152152.1
RDW14.7213.415.7692960
WBC6.7776.5053.583633.41.1
Class0.50.50.500810

Table 3: Statistical features of the dataset. This table lists the statistical features of the data.

Additionally, Table 4 shows the obtained correlation coefficient between each attribute and the class attribute. Its values fall between 0 (least correlated) and 1 (most correlated). It is added as a preliminary statistical analysis to explain the dataset attributes. MPV, Gender, HGB, MCHC, and age are among the most significant features.

Attributes' pairsCorrelation coefficient
AGE0.212473
GENDER0.423584
BASO-0.33242
EOS0.048184
HCT-0.376843
HGB0.275277
LYMP0.055856
MCH0.128111
MCHC0.272635
MCV0.013022
MONO0.055476
MPV0.564992
NEUT0.031185
PLATELET_COUNT0.095253
RBC0.030787
RDW0.025979
WBC0.148842
Class1

Table 4: Correlation with class attribute. This table shows the correlation between the attributes (features) and the class attribute.

Experimental setup
In the current study, the Python programming language is used to preprocess the collected dataset. Due to human error, specific values are missing during data entry and selection. Imputation techniques have been used to handle missing values in the dataset. It is done by replacing missing values with the attribute's average. Because of its simplicity, model compatibility, and preservation of sample mean values, as discussed with healthcare professionals. Moreover, feature selection is performed using correlation coefficients to rank the attributes. The features with higher correlation values were selected for analysis alongside the complete feature set. Python libraries have been utilized for the implementation of the proposed approach, hyperparameter tuning, and obtaining results. The feature selection and elimination were performed by an attribute selection method to identify the feature groups with the highest accuracy. Additionally, Python was used to evaluate accuracy using 10-fold cross-validation and Direct Partition Ratio evaluation methods, as specified by the given equations39,40. Finally, the mean of all the evaluation methods' outputs was calculated to compare the assessment methods used and determine the method with the best average accuracy38. In addition, confusion matrices were generated using the optimal parameters of SVM, ANN, RF, and NB. Then, a comparison of false positives (FP), false negatives (FN), true positives (TP), and true negatives (TN) ratios was performed through the calculation of the F1-score, which is an efficient measure for comparing the best method41,42.

Evaluation metrics
To evaluate the performance of the proposed approach, four well-known performance metrics are considered. Namely, accuracy, precision, recall, and F1-score. Classification accuracy is the primary measure because the percentage of correctly classified instances is calculated for each instance. Precision is the number of total TP points expected. The recall is the number of TP more than each actual positive. F1-score performance of each class. According to the capability of the outputs, the following metrics are obtained43,44,45:

True positive (TP): Result of correctly diagnosed as asthma.

False positive (FP): Result of being incorrectly diagnosed as having asthma.

True negative (TN): Result of correct cases diagnosed as non-asthma.

False negative (FN): Result of being incorrectly diagnosed as non-asthma.

Accuracy formula: TP+TN divided by TP+TN+FP+FN, statistical method, equation.     (4)

Precision formula, TP/(TP+FP), statistical analysis equation, data classification method.           (5)

Recall calculation formula, Recall = TP/(TP+FN), relevant to data analysis and evaluation.            (6)

Optimization strategy
To determine the optimal parameters for each of the proposed approaches, the Grid Search technique was employed. Specific parameter possible values are placed in the Grid Search method. Therefore, it can develop the best possible performance for precision improvement.

Figures 5–7 depict the outcome of the Grid Search technique for all four proposed approaches and the implementation procedure on the dataset with the assistance of the top eighteen attributes. Figure 5 presents the SVM accuracies obtained with various parameter values. The input values for the Cost and Gamma corresponding to multiple types of kernels are as follows:

Kernel types: Linear, Radial, Sigmoid, and Polynomial.

Gamma: 0.001 and 0.0001.

Cost: 1, 10, 100, and 1000.

Grid search cross-validation results for hyperparameter tuning; line graph; C vs. mean test score.
Figure 5: SVM with different cost and gamma types. This figure demonstrates SVM behavior with different cost and gamma types. Abbreviations; SVM = support vector machine. Please click here to view a larger version of this figure.

For the RF, the systematically input values indicated in Figure 6 are as follows:

The parameter names, together with their respective values, are provided below:

Number of features: 'auto', 'sqrt'.

Max depth: 1, 3, 5, 7.

MIN samples split: 2, 3, 4, 5.

MIN samples leaf: 2, 3, 4, 5.

CV grid search results graph; machine learning, parameter tuning; min_samples_leaf, max_depth.
Figure 6: RF with different depth values and minimum sample leaves. This figure depicts RF behavior with different depths and minimum sample leaf values. Abbreviations; RF = random forest. Please click here to view a larger version of this figure.

For the ANN, the analytical input values are indicated in Figure 7.

The parameter names, along with their respective values, are the following:

Hidden layer sizes: (50, 50, 50), (50, 100, 50), and (100,).
Activation: 'tanh', and 'relu'.
Solver: 'sgd', and 'adam'.
Alpha: 0.1, 0.01, 0.001, 0.0001, 0.5, 0.005, 0.0005, and 0.05.
Learning rate: 'constant', and 'adaptive'.

CV grid search line chart; neural network mean test score vs hidden layers; learning rate effects.
Figure 7: ANN with different alpha values and hidden layer sizes. This figure depicts the behavior of the ANN with different alpha values and hidden layer sizes. Abbreviations; ANN = artificial neural network. Please click here to view a larger version of this figure.

ClassifierParameterValue
SVMGamma0.0001
Kernel type'rbf'
Cost ‘C”10
Random state42
RFMax depth3
Min samples leaf2
Min samples split2
Random state42
ANNactivation'relu'
alpha0.001
size of hidden layer(50,50,50)
Learning rate'constant'
solver‘adam’
Random state42

Table 5: Summary of optimal parameters for the proposed classifiers. This table summarizes the optimal parameters obtained for the proposed classifiers.

Results

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Influence of feature selection
In the current study, the recursive feature elimination-based correlation coefficient method is used for the feature selection. The correlation coefficient is used to order the features from the highest to the lowest according to their correlation values. Recursive feature elimination is used to assist in selecting the suitable features for the model, as it progressively eliminates the weakest features until only one remains. Every process runs the four selected classifiers with 10-fold cross-validation to find the group with the highest accuracy. Below is the process of elimination method:

Firstly, the algorithms including NB, RF, SVM, and ANN are built with N features. Secondly, the correlation coefficient of all the features is calculated, and the top N/2 features are picked. Thirdly, the second phase is repeated until a single feature is left. Finally, the feature set that performed the best is selected by using Direct Partition. Moreover, to ensure an unbiased measure of performance, feature selection was carried out in each of the folds during the cross-validation process. This is because the selected features are based on training instances alone, with the validation instances completely "unseen". Half of the features are used to get an average result, that is, with nine features. Using nine features, an average accuracy of 91.29% was obtained with NB, RF, SVM, and ANN algorithms. However, ANN, RF, and SVM achieved 93.94% accuracy using nine features, which is higher than the accuracies reported in earlier studies that utilized larger feature sets, as shown in Table 6.

Number of attributesAccuracy of Classifier (%)AVERAGE
NBRFSVMANN
17 (All)96.97%96.63%100%96.97%97.64%
983.33%93.94%93.94%93.94%91.29%
584.85%92.42%87.88%89.39%88.64%
383.33%90.91%86.36%89.39%87.50%
284.85%90.91%74.24%89.39%84.85%
177.27%98.39%74.24%77.27%81.79%

Table 6: Results of different attributes. This table shows the outcome of classifiers using various attributes.

According to Table 6, the first iteration got 97.64% that is the highest accuracy, and it has all seventeen features (Age, Gender, BASO, EOS, HCT, HGB, LYMP, MCH, MCHC, MCV, MONO, MPV, NEUT, PLATELET_COUNT, RBC, RDW, and WBC). The second iteration got 91.29% accuracy with nine attributes, which were (MPV, Gender, HCT, BASO, HGB, MCHC, Age, WBC, and MCH), while the third iteration got 88.64% accuracy with five attributes, which were (MPV, Gender, HCT, BASO, and HGB). Additionally, the fourth iteration achieved 87.50% accuracy with three attributes (MPV, Gender, and HCT). Moreover, the fifth iteration achieved 84.85% accuracy with two attributes (MPV and Gender), and the last iteration achieved 81.79% accuracy with just MPV.

Additionally, as mentioned in Table 6, the investigation was carried out with the full feature set (17), followed by various subsets (9, 5, 3, 2, 1). However, the results with all features outperformed the rest of the combinations, and reducing the number of features resulted in degradation in the average accuracy, respectively by 6.35%, 9%, 10.14%, 12.79%, and 15.85%. It is noteworthy that gender and age-based separate analyses were not included in the scope of the current study, mainly due to the limited number of instances, and rather added in future work.

Investigating the effect of partitioning on the proposed techniques
The training-to-testing split ratio of the dataset under consideration plays a crucial role in achieving improved model performance on the aforementioned figures of merit. Table 7 presents the various ratios applied in the techniques, along with the results for the dataset. According to Table 5, an 80:20 split was found to be the optimal ratio for performance accuracy.

RatioSVMANNRFNBAverage accuracy
50-5082.91%89.33%93%78.59%85.96%
60-4082.91%89.33%93%78.59%85.96%
70-3082.91%89.33%93%78.59%85.96%
80-2082.91%89.33%93%78.59%85.96%
90-1082.91%89.33%93%78.59%85.96%

Table 7: The result of different partition ratios. This table shows the outcome of classifiers using various partition ratios.

Comparing 10-fold cross-validation and direct partition
The relative performance of the four classifiers being compared is given in Table 6. The classifiers have used 10-fold cross-validation and the ratio-based direct data split. The filled accuracies in the bottom row for the 10-Fold cross-validation are the highest accuracies from Table 8. 

SVMANNRFNBAverage accuracy
10-Fold cross-validation94%94%###83%91%
Direct Partition 82.9#############85.96%

Table 8: The result of the dataset split ratio and cross-validation. This table shows the outcomes of classifiers using the partition ratio rather than cross-validation.

Figure 8 presents the area under the receiver operating curve (AUROC) for the proposed algorithms, including NB, RF, SVM, and ANN, respectively. The mean value of AUROC for all the classifiers is above 93%, which shows the promising nature of the proposed algorithms for asthma detection.

ROC curves for NB, RF, SVM, and ANN model comparison; chart; AUC analysis; machine learning evaluation.
Figure 8: AUROC for proposed classifiers. (A) This figure depicts the AUROC behavior of the NB classifier. (B) This figure depicts the AUROC behavior of the RF classifier. (C) This figure depicts the AUROC behavior of the SVM classifier. (D) This figure depicts the AUROC behavior of the ANN classifier. Abbreviations; AUROC = area under the receiver operating curve; NB = Naive Bayes; ANN = artificial neural network; RF = random forest. Please click here to view a larger version of this figure.

The highlighted row represents the optimal accuracy in the partition ratio. The performance of SVM, ANN, RF, and NB on the best features and parameters is presented in Tables 9–12. It is worth noting that the parameters for each classifier were also re-examined to optimize them for the proposed classification problem at hand.

PrecisionRecallF1-scoreSupport
00.940.940.940.32
10.940.940.940.34

Table 9: Classification performance using SVM features and parameters. This table presents the performance of SVM features and parameters.

PrecisionRecallF1-scoreSupport
00.970.910.940.32
10.920.970,940.34

Table 10: Classification performance using ANN features and parameters. This table presents the performance of ANN features and parameters.

PrecisionRecallF1-scoreSupport
00.940.940.940.32
10.940.940.940.34

Table 11: Classification performance using RF features and parameters. This table presents the performance of RF features and parameters.

PrecisionRecallF1-scoreSupport
00.920.720.810.32
10.780.940.850.34

Table 12: Classification performance using NB features and parameters. This table presents the performance of NB features and parameters.

Analyses of confusion matrices
Tables 13–16 show the confusion matrices of using the best selection of the SVM, ANN, RF, and NB with optimum parameters shown in Table 5. In these tables, the positive and negative labels represent the presence and absence of asthma. Here, the diagonal elements, positive-positive and negative-negative, refer to true positives and true negatives, respectively. The other diagonals, positive-negative and negative-positive, refer to false negatives and false positives, respectively.  

Predicated
Positive Negative 
ActualPositive 302
Negative 232

Table 13: Results of the SVM based on the optimal parameters. This table presents the performance of the SVM classifier using optimum parameters.

Predicated
Positive Negative 
ActualPositive 293
Negative 133

Table 14: Results of the ANN based on the optimal parameters. This table presents the performance of the ANN classifier using optimum parameters.

Predicated
Positive Negative 
ActualPositive 302
Negative 232

Table 15: Results of the RF based on the optimal parameters. This table presents the performance of the RF classifier using optimum parameters.

Predicated
Positive Negative 
ActualPositive 239
Negative 232

Table 16: Results of the NB based on the optimal parameters. This table presents the performance of the NB classifier using optimum parameters.

In error rates, two essential points must be considered: false negatives (FN), which occur when a negative test result is incorrect, and false positives (FP), which occur when a positive test result is incorrect. Thus, knowing the error rate can prevent incorrect diagnosis in FN and reduce the error rate. In this case, the NB technique has a severe FN; the error analysis has 9 FN cases against 2 FP. The 9 FN cases lead to a more critical issue because it indicates the patient is infected, and the test results show no disease infection.  In medical diagnostics, the lower the false negative rate, the better the technique's performance. Comparing the ratios between FP, FN, TP, and TN by calculating the F1-score is the best way to identify which method will yield the best performance. The F1-score is presented in Equation 746,47,48.

F1-score formula for evaluation, showing calculation using precision and recall, mathematical equation.        (7)

The F1-score for each technique is shown in Table 17.  

SVMANNRFNB
F- measure0.940.940.940.83
(Weighted avg.)

Table 17: F1-score for each classifier. This table presents the F1-score analysis of all classifiers.

The best performance is observed here from the SVM, ANN, and RF, which is suitable because it achieved the highest accuracy. When comparing them, the RF yielded better results than the ANN and SVM.

DATA AVAILABILITY:
The raw data set used for this study is provided as a supplementary file.

Discussion

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The current study employs several machine learning algorithms, including SVM, ANN, RF, and NB. Upon conducting several experiments, fine-tuning hyperparameters and feature selection, it is concluded that SVM, ANN, and RF exhibit a remarkable diagnostic accuracy of 94%, while NB is comparatively lower at 83.33%. The results were validated using 10-fold cross-validation and optimized feature selection, as well as the best split ratio. According to the given data, seventeen parameters of the patients were ranked in order of their correlation coefficient values, and it was shown that low MPV and GENDER have the highest positive association with asthma. On the other hand, HCT and BASO showed strong negative associations, indicating systemic inflammation and migration of cells under respiratory strain. Other common immune factors include EOS, where a high peripheral eosinophil count indicates the severity of the disease.

In contrast to state-of-the-art approaches that employ clinical datasets for asthma diagnostics using machine learning, the proposed scheme exhibits better performance, with a 3.9% improvement in diagnosis accuracy. Notably, studies in16,17 both achieved the highest accuracy of 90.91%. It is noteworthy that the current study is a collaboration between computer scientists and clinicians. From dataset attainment to results and discussion, at each stage, the team of clinicians was involved, and their continuous feedback was obtained. Their satisfaction with the findings of the study warrants that the findings adequately contribute to clinical practice. As far as the higher accuracy of the algorithms is concerned, it is mainly because the dataset was comprehensive and adequately labeled, which enabled the machine learning algorithm to achieve good classification and robustness in results.

Regarding the limitations of the study, the current investigation focuses on the impact of machine learning algorithms on a locally collected, preprocessed, and manually annotated single dataset, which lacks a longitudinal aspect. To generalize the study, data integration and augmentation techniques are recommended, as well as incorporating additional datasets from open-source repositories. Additionally, demographic and social factors such as family history, smoking and drinking habits, and other social factors must also be considered in future studies. Gender and age-based separate analyses were not included in the scope of the current study, mainly due to the limited number of instances, and should be investigated as a future expansion of the study. Further analysis based on matched groups is needed to validate other significant factors identified in the proposed study to develop machine learning models that suit different cohorts. Moreover, the current study investigates a bunch of machine learning algorithms. In the future, other algorithms such as ensemble techniques and deep learning may be examined for further improvement in the results. The use of explainable AI (XAI) models, such as SHAP and LIME, can even make the study more interesting by ranking and identifying the essential features in the dataset43.

The current study investigates the machine learning potential in early diagnostics of asthma disease in Saudi Arabia. In this regard, a real-life dataset was collected from a local public sector hospital. Furthermore, the study aimed to achieve the highest level of accuracy with a relatively small number of features. To accomplish this, feature selection and hyperparameter tuning have been incorporated during the machine learning model development. After a careful literature review, four promising techniques have been shortlisted, including random forest (RF), artificial neural network (ANN), support vector machine (SVM), and Naïve Bayesian (NB). Consequently, the highest accuracy was achieved using the RF, SVM, and ANN techniques at 94%. At the same time, the NB technique attained an accuracy of 83.33%. These accuracies have been obtained using nine features and are promising in nature. RF, SVM, and ANN are therefore recommended under this scenario. In contrast to state-of-the-art results, the results are improved and comparable, with the highest accuracy of 90.91% achieved using similar techniques by a few researchers15,16. Further research could reveal other mechanisms to utilize different machine-learning techniques, particularly ensemble approaches, to achieve higher accuracy in this domain. Data augmentation techniques can be employed to further enhance the dataset, thereby improving its generalizability. Moreover, explainable AI can provide further insight into the root causes of the disease49,50.

Disclosures

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The dataset has been obtained from the hospital under IRB-2020-09-429. The authors declare that they have no conflicts of interest to report regarding the current study. The study has been added to the Research Square repository as a preprint with the following citation50.

AUTHORS’ CONTRIBUTIONS:
Conceptualization has been done by S.O., M.I.B.A. and A.R.; Supervision has been done by S.O., M.I.B.A. and J.A.; Writing original draft was performed by S.S.A., E.A. and Z.A.; Implementation was done by S.A.A., Y.A. and R.A.; Validation was done by J.A., M.A. and S.D.; Data Curation was done by A.B., Y.A., E.A. and Z.A.; Review and Editing was performed by A.R., S.D. and A.B.; Project Administration was performed by A.R., S.D. and M.A. and Funding Acquisition was performed by A.B., S.D. and A.R.

Acknowledgements

Loading...
$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The authors would like to acknowledge the support of healthcare professionals in validating the study's findings.

Materials

List of materials used in this article
NameCompanyCatalog NumberComments
Excel 365MicrosoftExcel 365Used to store raw data in csv format
Laptop/MachineDellXPS9320RAM 16GB, 12th Gen Intel(R) Core(TM) i7-1260P
Mlxtend 0.23.4Google colab Notebook Mlxtend 0.23.4Used for model building training 
Numpy 2.0.2Google colab Notebook Numpy 2.0.2Used for model building training 
Pandas 2.2.2Google colab Notebook Pandas 2.2.2Used for model building training 
Python 3.12.12Google colab Notebook Python 3.12.12Used for model building training 
Sklearn 1.6.1Google colab Notebook Sklearn 1.6.1Used for model building training 
SPSS IBM, USAVersion 26.0Used for statisitcal analysis
XGbbost 3.1.2Google colab Notebook XGbbost 3.1.2Used for model building training 

Reprints and Permissions

Request permission to reuse the text or figures of this JoVE article

Request Permission

Tags

Asthma DiagnosisComputational IntelligenceMachine LearningEarly DetectionSaudi Arabia AsthmaRandom ForestsSupport Vector MachinesArtificial Neural NetworksMedical Data AnalysisChronic Disease Prediction

Related Articles