$$\rightleftharpoonup{xx}$$
$$\longleftharp{xx}$$,
$$\longrightharp{xx}$$,
This section describes the machine learning algorithms investigated in the proposed asthma diagnostics approach. The criteria and reasons behind the selection of the proposed method are based on a comprehensive literature review and a long history of dealing with preemptive diagnosis of several chronic diseases27,28,29,30,31. These algorithms produced promising results. All the materials and tools used in this study are listed in the Table of Materials.
Support vector machines (SVM)
The SVM algorithm is among the most successful algorithms in pattern recognition and classification32. It uses binary classification, where it classifies a training set of vectors into two classes. It belongs to the family of supervised learning algorithms, where the desired output is generated under supervision32. In comparison with other classification algorithms in machine learning, SVM provides better results for classification performance. Therefore, it has been used extensively in numerous applications such as speech, face, and handwriting recognition. Furthermore, SVM has also been applied in various areas, including disease prediction and stock prediction. Additionally, due to its insensitivity to noise or overfitting, SVM can handle unbalanced data, which is a typical case in medical diagnostics, where positive cases are fewer in number than negative cases32.
In classification, SVM separates two classes by drawing a decision border, in which it can forecast the labels by distinguishing one or more feature vectors32. The decision boundary is referred to as the hyperplane, which is the farthest from the closest data points of each class. Those data points are referred to as support vectors. Figure 1 shows some candidate hyperplanes in linearly separable data. Equation 1 demonstrates the hyperplane equation, while equations 2 and 3 show the elements located above and below the plane32:
hyperplane equation (1)
Here, w is a vector representing weight, x is a vector representing the input, and b. represents a potential bias.
For all j, such that
= +1 (2)
For all i, such that
= -1 (3)

Figure 1: A typical SVM with two separable classes. This figure visualizes a typical SVM with two separable classes. Abbreviations; SVM = support vector machine. Please click here to view a larger version of this figure.
Artificial neural network (ANN)
ANN is a computational intelligence technique in soft computing that mimics the functionality of the human brain system. It has a vast number of interconnected neurons33. It is among the supervised machine learning algorithms that can learn from example datasets and improve their performance through learning, generating useful outputs33. ANN's architecture comprises multiple layers: an input layer, a hidden layer, and an output layer. Each of them is formed by a set of connected neurons33.
The neuron that receives the data from outside performs some processing and then sends the data to another layer known as the input layer. The purpose of the input layer is not to perform any complex process on the data; it is simply to duplicate the input value and send it to the next layer33. The output layer contains neurons that produce data for the user program. Between these two layers, the hidden layer is located as an optional layer to perform computational processes. The hidden layer serves as an intermediate layer that receives inputs from the input neurons, performs mathematical operations on them, and then passes the results to another layer33. Figure 2 shows the ANN architecture.

Figure 2: Feed-forward and backpropagation in a typical ANN structure. This figure visualizes a feed-forward network with backpropagation in a typical ANN structure. Abbreviations; ANN = artificial neural network. Please click here to view a larger version of this figure.
The feed-forward NN is an approach that stores input data in a forward direction. On the opposite direction, the back-propagation process occurs as the neural network weights update in a backward approach to obtain the gradients, using a neural network with several hidden layers. The drawback of this process is the vanishing or exploding of gradients. The activation function maintains the nonlinear characteristics of their ANN, which enables it to learn detailed relations between input and output data, as declared a universal approximation. Figure 3 demonstrates the main architecture of the artificial neural network. It also illustrates the activation function applied to every neuron's output, which represents the sum of all input weights. The bias holds the sum of all weights and is included in the neuron input33.

Figure 3: A typical single perceptron neuron for an ANN. This figure depicts a single perceptron neuron of a typical ANN. Abbreviations; ANN = artificial neural network. Please click here to view a larger version of this figure.
Random forest (RF)
RF is among the prominent classification algorithms in machine learning. It consists of several individual Decision Trees working together as an ensemble and producing the final output29. The basic concept behind the successful performance of the RF is that it is made up of many moderately uncorrelated trees (forming a forest), which act as a committee. Thus, they will be more efficient than all models that run independently34. The RF algorithm was primarily developed by a group of researchers35. To better understand how RF works, it is essential to understand decision trees35. It is equally applicable to both discrete and continuous problems, specifically classification, detection, and regression, respectively36. The more trees in the forest, the closer to accuracy the result will be, but with an added computational complexity36, as depicted in Figure 4.

Figure 4: Random Forest schematic. This figure demonstrates a schematic of a typical random forest. Please click here to view a larger version of this figure.
Naive Bayes (NB)
Naïve Bayes (NB) is a classification algorithm that employs the Bayes theorem to classify objects. It is a very popular method applied in most medical fields, especially for diagnosing symptoms. In this method, objects are taking the principal independence assumptions, which makes the relation between the attributes independent of each other. Due to its naïve nature, it is among the simplest classification algorithms in machine learning37. Based on the given dataset, NB determines the likelihood of a particular output. The NB model is simple to develop, can manage substantial data, and is a sophisticated and computationally light algorithm. Additionally, it delivers modest functions, resulting in numeric and nominal knowledge. Robustness and simple learning interpretations are also the potential advantages of the NB classifier. If it is employed on a limited amount of data, it will result in a relatively inaccurate outcome. It depends on enormous records, which are considered less precise compared to other techniques37.
Statistical analysis
Statistical analysis of the dataset helps visualize and understand the pattern of the dataset for better data pre-processing and modeling. In this regard, the following steps were carried out. Firstly, to investigate if the demographic features are imbalanced between positive and negative groups, including age and gender, a statistical analysis is performed on SPSS software. Secondly, compile the quantitative data using an independent sample t-test if the normal distribution and homogeneity of variance were satisfied, or the Man-Whitney U test. Finally, the categorical data has been compiled using chi-squared test or Fisher’s exact test38.
Implementation
Dataset description
The dataset for the current study was obtained from King Fahad University Hospital in Dammam, Saudi Arabia, following approval by the institutional review board (IRB). The said data was collected according to standard tests among the patients. asthma is often diagnosed through standard tests among patients, including blood tests, virus tests, and biochemistry tests. For the current study, the patients and healthy controls (HC) were recruited and clinically evaluated based on the standard triage and pathological tests conducted at the hospital under the recommendations of physicians in the department. The clinical inclusion and exclusion criteria were defined by the doctors based on the laboratory results. Subsequently, the qualified and verified data were provided for the study. The dataset contains seventeen attributes, as they are available for all asthma patients. The dataset includes 328 total instances with 165 asthma-positive and 163 asthma-negative instances. The distribution of gender and age in the positive and negative groups is listed in Table 1.
| Instances | Positive | Negative |
| Male | 145 | 107 | 38 |
| Female | 183 | 57 | 126 |
| Total | 328 | 164 | 164 |
Table 1: Dataset gender distribution. This table shows the gender distribution in the dataset.
Table 2 shows the age distribution between the two classes.
| Age Range | Age Distribution (Class = 1) | Age Distribution (Class = 0) |
| 1–10 | 0.059113 | 0 |
| 11–20 | 0.113301 | 0.060869 |
| 21–30 | 0.083743 | 0.177947 |
| 31–40 | 0.157632 | 0.164912 |
| 41–50 | 0.17734 | 0.164912 |
| 51–60 | 0.197044 | 0.099566 |
| 61–70 | 0.108374 | 0.047619 |
| 71–80 | 0.078817 | 0.025974 |
| 81–90 | 0.019704 | 0.012987 |
| 91–100 | 0.004926 | 0.012987 |
Table 2: Dataset age distribution. This table shows the age distribution in the dataset.
The age (p < 0.001 in Man-Whitney U test) and gender (p < 0.001 in the Chi-squared test) were unbalanced between the positive and negative groups. However, there is no bias in the recruitment process. It is believed that the unbalanced distribution may reflect the unique age distribution of this patient cohort. Inclusion criteria include clinical factors for the disease presence and absence; demographics, including gender, age, and local population, are considered. Moreover, the data was collected under informed consent. Age and gender are included as potential features in the feature set, as mentioned below. However, explicit age and gender based analyses are not in the scope of the study and left as future work.
The dataset is stored as numeric values. The "Class" column contains the values 0 and 1, where 0 represents the "Normal patient" and 1 represents the "Asthma patient".
In this regard, the following seventeen attributes were used:
Class: (0 = “Normal”, 1 = “Asthma Patient”)
1. age in years (AGE)
2. gender (1 = male, 0 = female): GENDER
3. Basophils (BASO)
4. Eosinophils (EOS)
5. Hematocrit (HCT)
6. Haemoglobin (HGB)
7. Lymphocytes (LYM)
8. Mean corpuscular haemoglobin (MCH)
9. Mean corpuscular haemoglobin concentration (MCHC)
10. Mean corpuscular volume (MCV)
11. Monocytes (MONO)
12. Mean platelet volume (MPV)
13. Neutrophils Function (NEUT)
14. Platelet count (PLATELET_COUNT)
15. Red blood cell count (RBC)
16. Red cell distribution width (RDW)
17. White Blood Cell (WBC)
Statistical features of the dataset
For better understanding, a statistical analysis of the dataset is depicted in the subsequent tables38. Table 3 shows the mean, Median, Standard Deviation, Maximum, and minimum values for the dataset under consideration.
| Mean | Median | Standard deviation | Max | Min |
| AGE | 39.1 | 36 | 18.136 | 93 | 2 |
| GENDER | 0.442 | 0 | 0.4974 | 1 | 0 |
| BASO | 0.07 | 0.07 | 0.0701 | 0.72 | 0 |
| EOS | 0.231 | 0.229 | 0.2048 | 2 | 0 |
| HCT | 40.88 | 40.7 | 6.0313 | 56.5 | 3.4 |
| HGB | 13.67 | 13.45 | 4.3446 | 81.3 | 5.9 |
| LYMP | 2.595 | 2.33 | 2.1843 | 33.4 | 0.4 |
| MCH | 26.78 | 27 | 3.3177 | 34.5 | 2.6 |
| MCHC | 32.71 | 33 | 2.1017 | 43.1 | 15 |
| MCV | 81.15 | 82.8 | 9.4426 | 102 | 13 |
| MONO | 1.189 | 0.613 | 6.1198 | 95 | 0.2 |
| MPV | 9.217 | 9.217 | 1.7297 | 13.4 | 0 |
| NEUT | 4.23 | 4.23 | 2.3074 | 16 | 0.6 |
| PLATELET_COUNT | 279.7 | 272.5 | 85.473 | 685 | 0 |
| RBC | 5.677 | 5 | 11.615 | 215 | 2.1 |
| RDW | 14.72 | 13.4 | 15.769 | 296 | 0 |
| WBC | 6.777 | 6.505 | 3.5836 | 33.4 | 1.1 |
| Class | 0.5 | 0.5 | 0.5008 | 1 | 0 |
Table 3: Statistical features of the dataset. This table lists the statistical features of the data.
Additionally, Table 4 shows the obtained correlation coefficient between each attribute and the class attribute. Its values fall between 0 (least correlated) and 1 (most correlated). It is added as a preliminary statistical analysis to explain the dataset attributes. MPV, Gender, HGB, MCHC, and age are among the most significant features.
| Attributes' pairs | Correlation coefficient |
| AGE | 0.212473 |
| GENDER | 0.423584 |
| BASO | -0.33242 |
| EOS | 0.048184 |
| HCT | -0.376843 |
| HGB | 0.275277 |
| LYMP | 0.055856 |
| MCH | 0.128111 |
| MCHC | 0.272635 |
| MCV | 0.013022 |
| MONO | 0.055476 |
| MPV | 0.564992 |
| NEUT | 0.031185 |
| PLATELET_COUNT | 0.095253 |
| RBC | 0.030787 |
| RDW | 0.025979 |
| WBC | 0.148842 |
| Class | 1 |
Table 4: Correlation with class attribute. This table shows the correlation between the attributes (features) and the class attribute.
Experimental setup
In the current study, the Python programming language is used to preprocess the collected dataset. Due to human error, specific values are missing during data entry and selection. Imputation techniques have been used to handle missing values in the dataset. It is done by replacing missing values with the attribute's average. Because of its simplicity, model compatibility, and preservation of sample mean values, as discussed with healthcare professionals. Moreover, feature selection is performed using correlation coefficients to rank the attributes. The features with higher correlation values were selected for analysis alongside the complete feature set. Python libraries have been utilized for the implementation of the proposed approach, hyperparameter tuning, and obtaining results. The feature selection and elimination were performed by an attribute selection method to identify the feature groups with the highest accuracy. Additionally, Python was used to evaluate accuracy using 10-fold cross-validation and Direct Partition Ratio evaluation methods, as specified by the given equations39,40. Finally, the mean of all the evaluation methods' outputs was calculated to compare the assessment methods used and determine the method with the best average accuracy38. In addition, confusion matrices were generated using the optimal parameters of SVM, ANN, RF, and NB. Then, a comparison of false positives (FP), false negatives (FN), true positives (TP), and true negatives (TN) ratios was performed through the calculation of the F1-score, which is an efficient measure for comparing the best method41,42.
Evaluation metrics
To evaluate the performance of the proposed approach, four well-known performance metrics are considered. Namely, accuracy, precision, recall, and F1-score. Classification accuracy is the primary measure because the percentage of correctly classified instances is calculated for each instance. Precision is the number of total TP points expected. The recall is the number of TP more than each actual positive. F1-score performance of each class. According to the capability of the outputs, the following metrics are obtained43,44,45:
True positive (TP): Result of correctly diagnosed as asthma.
False positive (FP): Result of being incorrectly diagnosed as having asthma.
True negative (TN): Result of correct cases diagnosed as non-asthma.
False negative (FN): Result of being incorrectly diagnosed as non-asthma.
(4)
(5)
(6)
Optimization strategy
To determine the optimal parameters for each of the proposed approaches, the Grid Search technique was employed. Specific parameter possible values are placed in the Grid Search method. Therefore, it can develop the best possible performance for precision improvement.
Figures 5–7 depict the outcome of the Grid Search technique for all four proposed approaches and the implementation procedure on the dataset with the assistance of the top eighteen attributes. Figure 5 presents the SVM accuracies obtained with various parameter values. The input values for the Cost and Gamma corresponding to multiple types of kernels are as follows:
Kernel types: Linear, Radial, Sigmoid, and Polynomial.
Gamma: 0.001 and 0.0001.
Cost: 1, 10, 100, and 1000.

Figure 5: SVM with different cost and gamma types. This figure demonstrates SVM behavior with different cost and gamma types. Abbreviations; SVM = support vector machine. Please click here to view a larger version of this figure.
For the RF, the systematically input values indicated in Figure 6 are as follows:
The parameter names, together with their respective values, are provided below:
Number of features: 'auto', 'sqrt'.
Max depth: 1, 3, 5, 7.
MIN samples split: 2, 3, 4, 5.
MIN samples leaf: 2, 3, 4, 5.

Figure 6: RF with different depth values and minimum sample leaves. This figure depicts RF behavior with different depths and minimum sample leaf values. Abbreviations; RF = random forest. Please click here to view a larger version of this figure.
For the ANN, the analytical input values are indicated in Figure 7.
The parameter names, along with their respective values, are the following:
Hidden layer sizes: (50, 50, 50), (50, 100, 50), and (100,).
Activation: 'tanh', and 'relu'.
Solver: 'sgd', and 'adam'.
Alpha: 0.1, 0.01, 0.001, 0.0001, 0.5, 0.005, 0.0005, and 0.05.
Learning rate: 'constant', and 'adaptive'.

Figure 7: ANN with different alpha values and hidden layer sizes. This figure depicts the behavior of the ANN with different alpha values and hidden layer sizes. Abbreviations; ANN = artificial neural network. Please click here to view a larger version of this figure.
| Classifier | Parameter | Value |
| SVM | Gamma | 0.0001 |
| Kernel type | 'rbf' |
| Cost ‘C” | 10 |
| Random state | 42 |
| RF | Max depth | 3 |
| Min samples leaf | 2 |
| Min samples split | 2 |
| Random state | 42 |
| ANN | activation | 'relu' |
| alpha | 0.001 |
| size of hidden layer | (50,50,50) |
| Learning rate | 'constant' |
| solver | ‘adam’ |
| Random state | 42 |
Table 5: Summary of optimal parameters for the proposed classifiers. This table summarizes the optimal parameters obtained for the proposed classifiers.