Selecting Multiple Biomarker Subsets with Similarly Effective Binary Classification Performances

Xin Feng; Shaofei Wang; Quewang Liu; Han Li; Jiamei Liu; Cheng Xu; Weifeng Yang; Yayun Shu; Weiwei Zheng; Bingxin Yu; Mingran Qi; Wenyang Zhou; Fengfeng Zhou

doi:10.3791/57738

Cancer Research

Selecting Multiple Biomarker Subsets with Similarly Effective Binary Classification Performances

Published: October 11, 2018 doi: 10.3791/57738

Xin Feng¹, Shaofei Wang¹, Quewang Liu¹, Han Li², Jiamei Liu², Cheng Xu², Weifeng Yang², Yayun Shu², Weiwei Zheng¹, Bingxin Yu³, Mingran Qi⁴, Wenyang Zhou¹, Fengfeng Zhou¹

¹College of Computer Science and Technology, and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, ²College of Software, Jilin University, ³Ultrasonography Department, China-Japan Union Hospital of Jilin University, ⁴Department of Pathogenobiology, College of Basic Medical Science, Jilin University

Summary

Existing algorithms generate one solution for a biomarker detection dataset. This protocol demonstrates the existence of multiple similarly effective solutions and presents a user-friendly software to help biomedical researchers investigate their datasets for the proposed challenge. Computer scientists may also provide this feature in their biomarker detection algorithms.

Abstract

Biomarker detection is one of the more important biomedical questions for high-throughput 'omics' researchers, and almost all existing biomarker detection algorithms generate one biomarker subset with the optimized performance measurement for a given dataset. However, a recent study demonstrated the existence of multiple biomarker subsets with similarly effective or even identical classification performances. This protocol presents a simple and straightforward methodology for detecting biomarker subsets with binary classification performances, better than a user-defined cutoff. The protocol consists of data preparation and loading, baseline information summarization, parameter tuning, biomarker screening, result visualization and interpretation, biomarker gene annotations, and result and visualization exportation at publication quality. The proposed biomarker screening strategy is intuitive and demonstrates a general rule for developing biomarker detection algorithms. A user-friendly graphical user interface (GUI) was developed using the programming language Python, allowing biomedical researchers to have direct access to their results. The source code and manual of kSolutionVis can be downloaded from http://www.healthinformaticslab.org/supp/resources.php.

Introduction

Binary classification, one of the most commonly investigated and challenging data mining problems in the biomedical area, is used to build a classification model trained on two groups of samples with the most accurate discrimination power¹^,²^,³^,⁴^,⁵^,⁶^,⁷. However, the big data generated in the biomedical field has the inherent "large p small n" paradigm, with the number of features usually much larger than the number of samples⁶^,⁸^,⁹. Therefore, biomedical researchers have to reduce the feature dimension before utilizing the classification algorithms to avoid the overfitting problem⁸^,⁹. Diagnosis biomarkers are defined as a subset of detected features separating patients of a given disease from healthy control samples¹⁰^,¹¹. Patients are usually defined as the positive samples, and the healthy controls are defined as the negative samples¹².

Recent studies have suggested that there exists more than one solution with identical or similarly effective classification performances for a biomedical dataset⁵. Almost all the feature selection algorithms are deterministic algorithms, producing only one solution for the same dataset. Genetic algorithms may simultaneously generate multiple solutions with similar performances, but they still try to select one solution with the best fitness function as the output for a given dataset¹³^,¹⁴.

Feature selection algorithms can be roughly grouped as either filters or wrappers¹². A filter algorithm chooses the top-k features ranked by their significant individual association with the binary class labels based on the assumption that features are independent of each other¹⁵^,¹⁶^,¹⁷. Although this assumption does not hold true for almost all real-world datasets, the heuristic filter rule performs well in many cases, for instance, the mRMR (Minimum Redundancy and Maximum Relevance) algorithm, the Wilcoxon test based feature filtering (WRank) algorithm, and the ROC (Receiver operating characteristic) plot based filtering (ROCRank) algorithm. mRMR, is an efficient filter algorithm because it approximates the combinatorial estimation problem with a series of much smaller problems, comparing to the maximum-dependency feature selection algorithm, each of which only involves two variables, and therefore uses pairwise joint probabilities which are more robust¹⁸^,¹⁹. However, mRMR may underestimate the usefulness of some features as it does not measure the interactions between features which can increase relevancy, and thus misses some feature combinations that are individually useless but are useful only when combined. The WRank algorithm calculates a non-parametric score of how discriminative a feature is between two classes of samples, and is known for its robustness for outliers²⁰^,²¹. Furthermore, the ROCRank algorithm evaluates how significant the Area Under the ROC Curve (AUC) of a particular feature is for the investigated binary classification performance²²^,²³.

On the other hand, a wrapper evaluates the pre-defined classifier's performance of a given feature subset, iteratively generated by a heuristic rule, and creates the feature subset with the best performance measurement²⁴. A wrapper generally outperforms a filter in the classification performance but runs slower²⁵. For example, the Regularized Random Forest (RRF)²⁶^,²⁷ algorithm uses a greedy rule, by evaluating the features on a subset of the training data at each random forest node, whose feature importance scores are evaluated by the Gini index. The choice of a new feature will be penalized if its information gain does not improve that of the chosen features. Additionally, the Prediction Analysis for Microarrays (PAM)²⁸^,²⁹ algorithm, also a wrapper algorithm, calculates a centroid for each of the class labels, and then selects features to shrink the gene centroids toward the overall class centroid. PAM is robust for outlying features.

Multiple solutions with the top classification performance may be necessary for any given dataset. Firstly, the optimization goal of a deterministic algorithm is defined by a mathematical formula, e.g., minimum error rate³⁰, which is not necessarily ideal for biological samples. Secondly, a dataset may have multiple, significantly different, solutions with similar effective or even identical performances. Almost all existing feature selection algorithms will randomly select one of these solutions as the output³¹.

This study will introduce an informatics analytic protocol for generating multiple feature selection solutions with similar performances for any given binary classification dataset. Considering that most biomedical researchers are not familiar with informatic techniques or computer coding, a user-friendly graphical user interface (GUI) was developed to facilitate the rapid analysis of biomedical binary classification datasets. The analytic protocol consists of data loading and summarizing, parameter tuning, pipeline execution, and result interpretations. With a simple click, the researcher is able to generate the biomarker subsets and publication-quality visualization plots. The protocol has been tested using the transcriptomes of two binary classification datasets of Acute Lymphoblastic Leukemia (ALL), i.e., ALL1 and ALL2¹². The datasets of ALL1 and ALL2 were downloaded from the Broad Institute Genome Data Analysis Center, available at http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi. ALL1 contains 128 samples with 12,625 features. Of these samples, 95 are B-cell ALL and 33 are T-cell ALL. ALL2 includes 100 samples with 12,625 features as well. Of these samples, there are 65 patients that suffered relapse and 35 patients that did not. ALL1 was an easy binary classification dataset, with a minimum accuracy of four filters and four wrappers being 96.7%, and 6 of the 8 feature selection algorithms achieving 100%¹². While ALL2 was a more difficult dataset, with the above 8 feature selection algorithms achieving no better than 83.7% accuracy¹². This best accuracy was achieved with 56 features detected by the wrapper algorithm, Correlation-based Feature Selection (CFS).

Subscription Required. Please recommend JoVE to your librarian.

Protocol

NOTE: The following protocol describes the details of the informatics analytic procedure and pseudo-codes of the major modules. The automatic analysis system was developed using Python version 3.6.0 and the Python modules pandas, abc, numpy, scipy, sklearn, sys, PyQt5, sys, mRMR, math and matplotlib. The materials used in this study are listed in the Table of Materials.

1. Prepare the Data Matrix and Class Labels

Prepare the data matrix file as a TAB- or comma-delimited matrix file, as illustrated in Figure 1A.
NOTE: Each row has all the values of a feature, and the first item is the feature name. A feature is a probeset ID for the microarray-based transcriptome dataset or may be another value ID like a cysteine residue with its methylation value in a methylomic dataset. Each column gives the feature values of a given sample, with the first item being the sample name. A row is separated into columns by a TAB (Figure 1B) or a comma (Figure 1C). A TAB-delimited matrix file is recognized by the file extension .tsv, and a comma-delimited matrix file has the extension .csv. This file may be generated by saving a matrix as either the .tsv or .csv format from software such as Microsoft Excel. The data matrix may also be generated by computer coding.
Prepare the class label file as a TAB- or comma-delimited matrix file (Figure 1D), similar to the data matrix file.
NOTE: The first column gives the sample names, and the class label of each sample is given in the column titled Class. Maximal compatibility is considered in the coding process, so that additional columns may be added. The class label file may be formatted as a .tsv or .csv file. The names in the column Class may be any terms, and there may be more than two classes of samples. The user may choose any two of the classes for the following analysis.

2. Load the Data Matrix and Class Labels

Load the data matrix and class labels into the software. Click the button Load data matrix to choose the user-specified data matrix file. Click the button Load class labels to choose the corresponding class label file.
NOTE: After both files are loaded, kSolutionVis will conduct a routine screen of the compatibility between the two files.
Summarize the features and samples from the data matrix file. Estimate the size of the data matrix file.
Summarize the samples and classes from the class label file. Estimate the size of the class label file.
Test whether each sample from the data matrix has a class label. Summarize the numbers of the samples with the class labels.

3. Summarize and Display the Baseline Statistics of the Dataset

Click the button Summarize, without any specified keyword input, and the software will display 20 indexed features and the corresponding features names.
NOTE: Users need to specify the feature name they wish to find to see its baseline statistics and corresponding value distribution among all input samples.
Provide a keyword, e.g. “1000_at”, in the textbox Feature to find a specific feature to be summarized. Click the button Summarize to get the baseline statistics for this given feature.
NOTE: The keyword may appear anywhere in the target feature names, facilitating the search process for users.
Click the button Summarize to find more than one feature with the given keyword, and then specify the unique feature ID to proceed with the above step of summarizing one particular feature.

4. Determine the Class Labels and the Number of Top-ranked Features

Choose the names of Positive (“P (33)”) and Negative (“N (95)”) classes in the dropdown boxes Class Positive and Class Negative, as shown in Figure 2 (middle).
NOTE: It is suggested to choose a balanced binary classification dataset, i.e., the difference between the numbers of positive and negative samples is minimal. The number of samples is also given in parenthesis after the name of each class label in the two dropdown boxes.
Choose 10 as the number of top-ranked features (parameter pTopX) in the dropdown box Top_X (?) for a comprehensive screen of the feature-subset.
NOTE: The software automatically ranks all the features by the P-value calculated by a t-test of each feature comparing the positive and negative classes. A feature with a smaller P-value has a better discriminating power between the two classes of samples. The comprehensive screening module is computationally intensive. The parameter pTopX is 10 by default. Users can change this parameter in the range of 10 to 50, until they find satisfying feature subsets with good classification performances.

5. Tune System Parameters for Different Performances

Choose the performance measurement (pMeasurement) Accuracy (Acc) in the dropdown box Acc/bAcc (?) for the selected classifier Extreme Learning Machine (ELM). Another option of this parameter is the measurement Balanced Accuracy (bAcc).
NOTE: Let TP, FN, TN, and FP be the numbers of true positives, false negatives, true negatives and false positives, respectively. The measurement Acc is defined as (TP+TN)/(TP+FN+TN+FP), which works best on a balanced dataset⁶. But a classifier optimized for Acc tends to assign all the samples to the negative class if the number of negative samples is much larger than that of positive ones. The bAcc is defined as (Sn+Sp)/2, where Sn = TP/(TP+FN) and Sp = TN/(TN+FP) are the correctly predicted rates for positive and negative samples, respectively. Therefore, bAcc normalizes the prediction performances over the two classes, and may lead to a balanced prediction performance over two unbalanced classes. Acc is the default choice of pMeasurement. The software uses the classifier ELM by default to calculate the classification performances. The user may also choose a classifier from SVM (Support Vector Machine), KNN (k Nearest Neighbor), Decision Tree, or Naïve Bayes.
Choose the cutoff value 0.70 (parameter pCutoff) for the specified performance measurement in the input box pCutoff:.
NOTE: Both Acc and bAcc range between 0 and 1, and the user may specify a value pCutoff[0, 1] as the cutoff to display the matched solutions. The software carries out a comprehensive feature-subset screening, and an appropriate choice of pCutoff will make the 3D visualization more intuitive and explicit. The default value for pCutoff is 0.70.

6. Run the Pipeline and Produce the INTERACTIVE VISUALIZED RESULTS

Click the button Analyze to run the pipeline and generate the visualization plots, as shown in Figure 2 (bottom).
NOTE: The left table gives all the feature subsets and their pMeasurement calculated by the 10-fold cross validation strategy of the classifier ELM, as described previously⁵. Two 3D scatter plots and two-line plots are generated for the feature-subset screening procedure with the current parameter settings.
Choose 0.70 as the default value of the pMeasurement cutoff (parameter piCutoff, input box Value), and 10 as the default of the number of best feature subsets (parameter piFSNum).
NOTE: The pipeline is executed using the parameters pTopX, pMeasurement, and pCutoff. The detected feature subsets may be further screened using the cutoff piCutoff, however piCutoff cannot be smaller than pCutoff. Therefore, piCutoff is initialized as pCutoff and only the feature subsets with the performance measurement ≥ piCutoff will be visualized. The default value of piCutoff is pCutoff. Sometimes kSolutionVis detects many solutions, and only the best piFSNum (default: 10) feature subsets will be visualized. If the number of feature subsets detected by the software is smaller than piFSNum, all the feature subsets will be visualized.
Collect and interpret the features detected by the software, as shown in Figure 3.
NOTE: The table in the left box shows the detected feature subsets and their performance measurements. The names of the first three columns are “F1”, “F2”, and “F3”. The three features in each feature subset are given in their ranking order in one row (F1 < F2 < F3). The last column gives the performance measurement (Acc or bAcc) of each feature subset, and its column name (Acc or bAcc) is the value of pMeasurement.

7. Interpret the 3D Scatter Plots-Visualize and Interpret the Feature Subsets with Similarly Effective Binary Classification Performances Using 3D Scatter Plots

Click the button Analyze to generate the 3D scatter plot of the top 10 feature subsets with the best classification performances (Acc or bAcc) detected by the software, as shown in Figure 3 (middle box). Sort the three features in a feature subset in ascending order of their ranks and use the ranks of the three features as the F1/F2/F3 axes, i.e., F1 < F2 < F3.
NOTE: The color of a dot represents the binary classification performance of the corresponding feature subset. A dataset may have multiple feature subsets with similarly effective performance measurements. Therefore, an interactive and simplified scatter plot is necessary.
Change the value to 0.70 in the input box pCutoff: and click the button Analyze to generate the 3D scatter plot of the feature subsets with the performance measurement ≥ piCutoff, as seen in Figure 3 (right box). Click the button 3D tuning to open a new window to manually tune the viewing angles of the 3D scatter plot.
NOTE: Each feature subset is represented by a dot in the same way as above. The 3D scatter plot was generated in the default angle. To facilitate the 3D visualization and tuning, a separate window will be opened by clicking the button 3D tuning.
Click the button Reduce to reduce the redundancy of the detected feature subsets.
NOTE: If users wish to further select the feature triplets and minimize the redundancy of the feature subsets, the software also provides this function using the mRMR feature selection algorithm. After clicking the Reduce button, kSolutionVis will remove those redundant features in the feature triplets and regenerate the table and the two scatter plots mentioned above. The removed features of the feature triplets will be replaced by the key word in the table. The values of None in the F1/F2/F3 axis will be denoted as the value of piFSNum (the range of the normal value of F1/F2/F3 is [1, top_x]). Therefore, the dots that include a None value may appear to be “outlier” dots in the 3D plots. The manually tunable 3D plots may be found in “Manual tuning of the 3D dot plots” in the supplementary material.

8. Find Gene Annotations and Their Associations with Human Diseases

NOTE: Steps 8 to 10 will illustrate how to annotate a gene from the sequence level of both DNA and protein. Firstly, the gene symbol of each biomarker ID from the above steps will be retrieved from the database DAVID³², and then two representative web servers will be used to analyze this gene symbol from the levels of DNA and protein, respectively. The server GeneCard provides a comprehensive functional annotation of a given gene symbol, and the Online Mendelian Inheritance in Man database (OMIM) provides the most comprehensive curation of disease-gene associations. The server UniProtKB is one of the most comprehensive protein database, and the server Group-based Prediction System (GPS) predicts the signaling phosphorylation’s for a very large list of kinases.

Copy and paste the web link of the database DAVID into a web browser and open the web page of this database. Click the link Gene ID Conversion seen in Figure 4A and input the feature IDs 38319_at/38147_at/33238_at of the first biomarker subset of the dataset ALL1 (Figure 4B). Click the link Gene List and click Submit List as shown in Figure 4B. Retrieve the annotations of interest and click Show Gene List (Figure 4C). Get the list of gene symbols (Figure 4D).
NOTE: The gene symbols retrieved here will be used for further functional annotations in the next steps.
Copy and paste the web link of the database Gene Cards into a web browser and open the web page of this database. Search a gene’s name CD3D in the database query input box and find the annotations of this gene from Gene Cards³³^,³⁴, as shown in Table 1 and Figure 5A.
NOTE: Gene Cards is a comprehensive gene knowledgebase, providing nomenclature, genomics, proteomics, subcellular localization, and involved pathways and other functional modules. It also provides external links to various other biomedical databases like PDB/PDB_REDO³⁵, Entrez Gene³⁶, OMIM³⁷, and UniProtKB³⁸. If the feature name is not a standard gene symbol, use the database ENSEMBL to convert it³⁹. CD3D is the name of the gene T-Cell Receptor T3 Delta Chain.
Copy and paste the web link of the database OMIM into a web browser and open the web page of this database. Search a gene’s name CD3D and find the annotations of this gene from the database OMIM³⁷, as shown in Table 1 and Figure 5B.
NOTE: OMIM serves now as one of the most comprehensive and authoritative sources of human gene connections with inheritable diseases. OMIM was initiated by Dr. Victor A. McKusick to catalog the disease-associated genetic mutations⁴⁰. OMIM now covers over 15,000 human genes and over 8,500 phenotypes, as of December 1^st 2017.

9. Annotate the Encoded Proteins and the Post-Translational Modifications

Copy and paste the web link of the database UniProtKB into a web browser and open the web page of this database. Search a gene’s name CD3D in the query input box of UniProtKB and find the annotations of this gene from the database³⁸, as shown in Table 1 and Figure 5C.
NOTE: UniProtKB collects a rich source of annotations for proteins, including both nomenclature and functional information. This database also provides external links to other widely used databases, including PDB/PDB_REDO³⁵, OMIM³⁷, and Pfam⁴¹.
Copy and paste the web link of the web server GPS into a web browser and open the web page of this web server. Retrieve the protein sequence encoded by the biomarker gene CD3D from the UniProtKB database³⁸ and predict the protein’s post-translational modification (PTM) residues using the online tool GPS, as shown in Table 1 and Figure 5D.
NOTE: A biological system is dynamic and complicated, and the existing databases collect only known information. Therefore, biomedical prediction online tools as well as offline programs may provide useful evidence to complement a hypothesized mechanism. GPS has been developed and improved for over 12 years⁷^,⁴² and may be used to predict a protein’s PTM residues in a given peptide sequence⁴³^,⁴⁴. Tools are also available for various research topics, including the prediction of a protein’s subcellular location⁴⁵ and transcription factor binding motifs ⁴⁶ among others .

10. Annotate Protein-Protein Interactions and Their Enriched Functional Modules

Copy and paste the web link of the web server String into a web browser and open the web page of this web server. Search the list for the genes CD3D and P53, and find their orchestrated properties using the database String⁴⁷. The same procedure may be carried out using another web server, DAVID³².
NOTE: Besides the aforementioned annotations for individual genes, there are many large-scale informatics tools available to investigate the properties of a group of genes. A recent study demonstrated that individually bad marker genes might constitute a much-improved gene set⁵. Therefore, it’s worth the computational cost to screen for more complicated biomarkers. The database String may visualize the known or predicted interaction connections, and the David server may detect the functional modules with significant phenotype-associations in the queried genes⁴⁷,³². Various other large-scale informatics analysis tools are also available.

11. Export the Generated Biomarker Subsets and the Visualization Plots

Export the detected biomarker subsets as a .tsv or .csv text file for further analysis. Click the button Export the Table under the table of all the detected biomarker subsets and choose which text format to save as.
Export the visualization plots as an image file. Click the button Save under each plot and choose which image format to save as.
NOTE: The software supports the pixel format .png and the vector format .svg. The pixel images are good for displaying on the computer screen, while the vector images may be converted to any resolution required for journal publication purposes.

Subscription Required. Please recommend JoVE to your librarian.

Representative Results

The goal of this workflow (Figure 6) is to detect multiple biomarker subsets with similar efficiencies for a binary classification dataset. The whole process is illustrated by two example datasets ALL1 and ALL2 extracted from a recently-published biomarker detection study¹²^,⁴⁸. A user may install kSolutionVis by following the instructions in the supplementary materials.

Dataset ALL1 profiled 12 625 transcriptomic features of 95 B-cell and 33 T-cell ALL patient blood samples. While dataset ALL2 detected the expression levels of 12 625 transcriptomic features for 65 ALL patients who relapsed after the treatment and 35 ALL patients who did not. For the user's convenience, both transcriptomic datasets and their class labels are provided in version 1.4 of the software. Both datasets are in the subdirectory "data" of the software's source code directory.

The two datasets, ALL1 and ALL2, were formatted as .csv files and loaded into the software using the Load data matrix and Load class labels buttons, as shown in Figure 7A-B. Figure 7A shows that all 128 samples with 12 625 features were loaded, and all 128 samples also have class labels. The final data matrix has 95 negative samples (B-cell ALL) and 33 positive samples (T-cell ALL). Additionally, users may also determine which class label is the positive class label (Figure 7A, bottom). If the class label file defines more than two classes, users may want to choose which two class labels to investigate. Similar operations were also conducted for the difficult dataset ALL2, as shown in Figure 7B.

The value distributions of the features in the data matrix may be investigated by clicking the button Summarize while searching for a user-specific keyword in the feature names, as shown in Figure 8. Figure 8A illustrates the histogram of feature 1012_at in the dataset ALL1. Furthermore, as seen in Figure 8B, the same feature 1012_at has a similar distribution of expression in both datasets. If no keyword was specified by the user, some feature names would be listed to help the users to decide which features to summarize.

The easier dataset ALL1 screened the top 10 ranked features (pTopX) for biomarker subsets with the pMeasurement Acc ≥ 0.90 (pCutoff). After clicking the button Run, the algorithm was executed, and the results as seen in Figure 9A, were illustrated in the bottom part of the software after a few seconds. From this, 120 qualified biomarker subsets were detected and listed in the left table of Figure 9A. ALL1 was an easy-to-discriminate dataset, in that it has 57 triplet biomarker subsets with 100% in Acc. This protocol emphasizes the existence of multiple similarly effective solutions for a binary classification problem. Therefore, the first 3D scatter plot may illustrate more than 10 (parameter piFSNum) biomarker subsets, if they have the classification performance Acc (parameter pMeasurement) ≥ that of the top 10 ranked (parameter piFSNum) biomarker subset. The user may also choose to display fewer biomarker subsets by changing the parameter piCutoff in the parameter box above the table in Figure 9A. The manual tuning of the 3D plots may be found in the section Manual tuning of the 3D dot plots in the supplementary material.

Furthermore, all the results may be exported as external files for further analysis by clicking the button Export the Table under the table or scatter plots, as shown in Figure 9.

The first biomarker subset (38319_at, 38147_at, and 33238_at) for the dataset ALL1 was chosen for functional investigations, as shown in Figure 9A. The search module of ENSEMBL (http://useast.ensembl.org/Multi/Search/New?db=core) annotated these three features as a gene cluster of differentiation 3 delta (CD3D, 38319_at), Signaling Lymphocytic Activation Molecule-Associated gene (SH2D1A, 38147_at) and Lymphocyte Cell-Specific Protein-Tyrosine Kinase (LCK, 33238_at). Furthermore, the gene-disease association database OMIM³⁷^,⁴⁰ suggested that the gene CD3D encodes the delta subunit of the T-cell antigen receptor complex and is involved in the 11q23 translocations frequently observed in acute leukemia in humans⁴⁹^,⁵⁰. OMIM also suggested that genomic mutations within the gene SH2D1A in the chromosome region of Xq25 may be associated with B-cell leukemia⁵¹^,⁵². Additionally, OMIM also highlighted a possible T-cell ALL associated fusion event of the LCK and beta T-cell receptor (TCRB)⁵³. Users may investigate other functional aspects of these biomarkers with their gene symbols, e.g., gene function annotations in Entrez Gene³⁶, protein function annotations in UniProtKB³⁸ or Pfam⁴¹, 3D protein structures in PDB/PDB_REDO³⁵, and PTM residues in GPS⁷^,⁴²^,⁴³^,⁴⁴. The interacting sub-network (database string⁴⁷) and enriched functional modules (database David³²) may also be screened for these biomarkers as an entirety. Various other databases or web servers may also facilitate the annotations and in silico predictions using the symbols or primary gene/protein sequences of these genes.

As seen in Table 2, the necessity of detecting more than one solution with identical or similarly effective performances is evident, with 57 groups of features with binary classification accuracies of 100% between B-cell and T-cell ALL samples. These particular biomarker subsets were called the perfect solutions. Quite a few biomarkers appeared in these perfect solutions repeatedly, suggesting that they may represent the key differences, on the molecular level, between B- and T-cell ALL. If the biomarker detection algorithm stops at detecting the first perfect solution of three genes CD3D/SH2D1A/LCK, another perfect solution CD74/HLA-DPB1/PRKCQ will be missed. For example, HLA-DPB1 is known to be significantly associated with the pediatric T-cell ALL but not B-cell ALL⁵⁴.

The three features of the first biomarker subset of ALL2 were chromatin assembly factor 1 subunit B (CHAF1B, 36912_at), exonuclease 1 (EXO1, 36041_at), and signal transducer and activator of transcription 6 (STAT6, 41222_at). CHAF1B was observed to be highly expressed in leukemia cell lines and the antibody against the CHAF1B encoded protein was significantly developed in acute myeloid leukemia (AML) patients⁵⁵. EXO1 was lost in some cases of acute leukemia⁵⁶, and upregulated in the leukemia cell line HL-60[R]. It also has been found to negatively regulate the alternative lengthening of telomeres (ALT) pathway, which facilitated the formation of ALT-associated PML (promyelocytic leukemia) bodies (APBs)⁵⁷. STAT6 was phosphorylated to activate the pro-survival and proliferative signaling pathway in the cases of relapsed AML⁵⁸. Taken together, the three genes were associated with the development and relapse of leukemia, but no explicit evidence was published on their associations with the ALL relapse. This may represent an interesting topic for further investigation.

The same annotation procedure may be conducted on any biomarker subset for ALL1 and ALL2. The three biomarkers investigated in the above section were not identified as relapse biomarkers in the dataset ALL2, as shown in Figure 9B. This suggests that biomarkers are phenotype-specific, which is another major challenge for biomarker detection, alongside the existence of multiple similarly effective solutions.

Some technical modules were implemented and described here for interested users. The error handling module provides informative messages for the user when errors occur during the execution of the software. The main error messages are listed and explained in "Error messages" in the supplementary material. A parallel calculation of the biomarkers was implemented for computers with more than one CPU core. The detailed improvements to the running time may be found in "Parallel running time" in the supplementary material. The data suggests that the usage of more CPU cores may not improve the running time due to the cost of switching between different CPU cores.

Figure 1: The example dataset extracted from the transcriptome dataset ALL1 has the first six features of the first nine samples of ALL1. The data matrix was formatted in (a) the visualization form, (b) the TAB-delimited text format file, and (c) the comma-delimited text format file. (d) The class label data was formatted in the visualization form. Due to the TAB character is invisible, it is illustrated as [TAB] in (b). The column Platform gives the microarray platform Affy in (b), and is not a required data column. Please click here to view a larger version of this figure.

Figure 2: Graphical user interface of the software. The baseline statistics are summarized in the top left box. Users may search for features of interest and investigate the value distributions in the two top right boxes. All the parameters for biomarker detection procedure may be tuned in the middle horizontal bar. All the biomarker subsets and their corresponding visualized distributions may be found in the bottom part. Please click here to view a larger version of this figure.

Figure 3: Biomarker subsets and their visualizations generated. Users may further refine the table and two 3D scatter plots using the parameters piCutoff and piFSNum. Please click here to view a larger version of this figure.

Figure 4: Gene annotations of the feature IDs detected in this study. Take the three feature IDs 38319_at/38147_at/33238_at of the first biomarker subset of the dataset ALL1. (a) Get the ID conversion module by clicking the link Gene ID Conversion. (b) Input the feature IDs in the red box 1, choose the feature type in the red box 2 (default "AFFYMETRIX_3PRIME_IVT_ID" is correct for this study), choose Gene List in the red box 3, and click Submit List in the red box 4. (c) Get all the functional annotations in this page and click Show Gene List to get the gene symbols of these queried features. (d) Get the gene symbols of the queried feature IDs. Please click here to view a larger version of this figure.

Figure 5: Annotations and enrichment analysis of the detected feature subsets. (a) Gene annotations from Gene Card. (b) OMIM describes the disease associations of each feature/gene. (c) Annotate the protein encoded by the gene of interest in the database UniProtKB. (d) Predict the tyrosine phosphorylation residues in the given protein using the online tool GPS. A red box was added to show the user where to click to input the query data. The primary sequence of the example protein CD3D may be retrieved as the FASTA format from the red box in (c), and input in the query window by click the red box in (d). Please click here to view a larger version of this figure.

Figure 6: Workflow of kSolutionVis. Each module of the software was described in the above protocol. Please click here to view a larger version of this figure.

Figure 7: Baseline statistics of the two representative datasets. The numbers of samples, features and classes in (a) ALL1 and (b) ALL2 are calculated. The file sizes of the data matrix and class labels are also detected. And a new data matrix is extracted from the samples with class labels. Please click here to view a larger version of this figure.

Figure 8: Histogram visualization of the feature 1012_at in the two datasets. Both baseline statistics and histogram were generated for (a) ALL1 and (b) ALL2. Please click here to view a larger version of this figure.

Figure 9: Biomarker subsets and the scatter plots of the two datasets. Users may change the parameters in the second row of parameter boxes to further refine the lists of biomarker subsets and 3D scatter plots for the datasets (a) ALL1 and (b) ALL2. Please click here to view a larger version of this figure.

Web site	Link	Functionality
GeneCards	http://www.genecards.org/cgi-bin/carddisp.pl?gene=CD3D	Gene annotation
OMIM	https://omim.org/entry/186790?search=CD3D&highlight=cd3d	Gene-disease association
UniProtKB	http://www.uniprot.org/uniprot/P04234	Protein annotation
GPS	http://gps.biocuckoo.org/	Protein’s PTM prediction
String	https://string-db.org/	Protein-protein interaction
David	https://david.ncifcrf.gov/	Gene Set Enrichment Analysis

Table 1. Websites for annotating and analyzing the detected biomarkers. A list of useful online tools that help annotate the detected biomarkers.

f1	f2	f3	Acc	Symbol1	Symbol2	Symbol3
38319_at	38147_at	33238_at	1.0000	CD3D	SH2D1A	LCK
33238_at	35016_at	37039_at	1.0000	LCK	CD74	HLA-DRA
38147_at	33238_at	35016_at	1.0000	SH2D1A	LCK	CD74
38147_at	33238_at	2059_s_at	1.0000	SH2D1A	LCK	LCK
38147_at	33238_at	37039_at	1.0000	SH2D1A	LCK	HLA-DRA
38147_at	33238_at	38095_i_at	1.0000	SH2D1A	LCK	HLA-DPB1
38147_at	33238_at	33039_at	1.0000	SH2D1A	LCK	TRAT1
38147_at	35016_at	2059_s_at	1.0000	SH2D1A	CD74	LCK
38147_at	35016_at	33039_at	1.0000	SH2D1A	CD74	TRAT1
38147_at	35016_at	38949_at	1.0000	SH2D1A	CD74	PRKCQ
38147_at	2059_s_at	37039_at	1.0000	SH2D1A	LCK	HLA-DRA
38147_at	2059_s_at	38095_i_at	1.0000	SH2D1A	LCK	HLA-DPB1
38147_at	37039_at	33039_at	1.0000	SH2D1A	HLA-DRA	TRAT1
38147_at	37039_at	38949_at	1.0000	SH2D1A	HLA-DRA	PRKCQ
38319_at	38147_at	35016_at	1.0000	CD3D	SH2D1A	CD74
38147_at	38833_at	38949_at	1.0000	SH2D1A	HLA-DPA1	PRKCQ
33238_at	35016_at	33039_at	1.0000	LCK	CD74	TRAT1
38319_at	38833_at	38949_at	1.0000	CD3D	HLA-DPA1	PRKCQ
33238_at	35016_at	38949_at	1.0000	LCK	CD74	PRKCQ
33238_at	2059_s_at	37039_at	1.0000	LCK	LCK	HLA-DRA
33238_at	37039_at	38095_i_at	1.0000	LCK	HLA-DRA	HLA-DPB1
33238_at	37039_at	33039_at	1.0000	LCK	HLA-DRA	TRAT1
33238_at	37039_at	38949_at	1.0000	LCK	HLA-DRA	PRKCQ
33238_at	38095_i_at	38949_at	1.0000	LCK	HLA-DPB1	PRKCQ
33238_at	38833_at	38949_at	1.0000	LCK	HLA-DPA1	PRKCQ
33238_at	33039_at	38949_at	1.0000	LCK	TRAT1	PRKCQ
35016_at	2059_s_at	33039_at	1.0000	CD74	LCK	TRAT1
35016_at	2059_s_at	38949_at	1.0000	CD74	LCK	PRKCQ
35016_at	38095_i_at	38949_at	1.0000	CD74	HLA-DPB1	PRKCQ
2059_s_at	37039_at	33039_at	1.0000	LCK	HLA-DRA	TRAT1
2059_s_at	38095_i_at	38949_at	1.0000	LCK	HLA-DPB1	PRKCQ
2059_s_at	38833_at	38949_at	1.0000	LCK	HLA-DPA1	PRKCQ
38319_at	33039_at	38949_at	1.0000	CD3D	TRAT1	PRKCQ
38147_at	38095_i_at	38949_at	1.0000	SH2D1A	HLA-DPB1	PRKCQ
38319_at	33238_at	38833_at	1.0000	CD3D	LCK	HLA-DPA1
38319_at	2059_s_at	38833_at	1.0000	CD3D	LCK	HLA-DPA1
38319_at	33238_at	33039_at	1.0000	CD3D	LCK	TRAT1
38319_at	33238_at	38095_i_at	1.0000	CD3D	LCK	HLA-DPB1
38319_at	33238_at	37039_at	1.0000	CD3D	LCK	HLA-DRA
38319_at	35016_at	38833_at	1.0000	CD3D	CD74	HLA-DPA1
38319_at	33238_at	2059_s_at	1.0000	CD3D	LCK	LCK
38319_at	35016_at	33039_at	1.0000	CD3D	CD74	TRAT1
38319_at	33238_at	35016_at	1.0000	CD3D	LCK	CD74
38319_at	35016_at	38949_at	1.0000	CD3D	CD74	PRKCQ
38319_at	2059_s_at	37039_at	1.0000	CD3D	LCK	HLA-DRA
38319_at	38147_at	38949_at	1.0000	CD3D	SH2D1A	PRKCQ
38319_at	38147_at	33039_at	1.0000	CD3D	SH2D1A	TRAT1
38319_at	33238_at	38949_at	1.0000	CD3D	LCK	PRKCQ
38319_at	2059_s_at	38095_i_at	1.0000	CD3D	LCK	HLA-DPB1
38319_at	38147_at	38833_at	1.0000	CD3D	SH2D1A	HLA-DPA1
38319_at	2059_s_at	33039_at	1.0000	CD3D	LCK	TRAT1
38319_at	38147_at	38095_i_at	1.0000	CD3D	SH2D1A	HLA-DPB1
38319_at	37039_at	33039_at	1.0000	CD3D	HLA-DRA	TRAT1
38319_at	38147_at	37039_at	1.0000	CD3D	SH2D1A	HLA-DRA
38319_at	38147_at	2059_s_at	1.0000	CD3D	SH2D1A	LCK
38319_at	2059_s_at	38949_at	1.0000	CD3D	LCK	PRKCQ
38319_at	35016_at	2059_s_at	1.0000	CD3D	CD74	LCK
2059_s_at	37039_at	38095_i_at	0.9922	LCK	HLA-DRA	HLA-DPB1
35016_at	33039_at	38949_at	0.9922	CD74	TRAT1	PRKCQ
2059_s_at	37039_at	38949_at	0.9922	LCK	HLA-DRA	PRKCQ
35016_at	2059_s_at	37039_at	0.9922	CD74	LCK	HLA-DRA
35016_at	37039_at	38949_at	0.9922	CD74	HLA-DRA	PRKCQ
35016_at	38833_at	38949_at	0.9922	CD74	HLA-DPA1	PRKCQ
2059_s_at	33039_at	38949_at	0.9922	LCK	TRAT1	PRKCQ
37039_at	38833_at	38949_at	0.9922	HLA-DRA	HLA-DPA1	PRKCQ
37039_at	33039_at	38949_at	0.9922	HLA-DRA	TRAT1	PRKCQ
38319_at	38095_i_at	38949_at	0.9922	CD3D	HLA-DPB1	PRKCQ
33238_at	37039_at	38833_at	0.9922	LCK	HLA-DRA	HLA-DPA1
38095_i_at	33039_at	38949_at	0.9922	HLA-DPB1	TRAT1	PRKCQ
33238_at	2059_s_at	38949_at	0.9922	LCK	LCK	PRKCQ
38319_at	38833_at	33039_at	0.9922	CD3D	HLA-DPA1	TRAT1
38833_at	33039_at	38949_at	0.9922	HLA-DPA1	TRAT1	PRKCQ
38147_at	33039_at	38949_at	0.9922	SH2D1A	TRAT1	PRKCQ
38319_at	37039_at	38833_at	0.9922	CD3D	HLA-DRA	HLA-DPA1
38147_at	2059_s_at	38949_at	0.9922	SH2D1A	LCK	PRKCQ
38147_at	38095_i_at	38833_at	0.9922	SH2D1A	HLA-DPB1	HLA-DPA1
38147_at	33238_at	38949_at	0.9922	SH2D1A	LCK	PRKCQ
38147_at	2059_s_at	33039_at	0.9922	SH2D1A	LCK	TRAT1
38319_at	37039_at	38949_at	0.9922	CD3D	HLA-DRA	PRKCQ
38319_at	38095_i_at	38833_at	0.9922	CD3D	HLA-DPB1	HLA-DPA1
38147_at	2059_s_at	38833_at	0.9922	SH2D1A	LCK	HLA-DPA1
33238_at	35016_at	2059_s_at	0.9922	LCK	CD74	LCK
38319_at	35016_at	38095_i_at	0.9922	CD3D	CD74	HLA-DPB1
33238_at	35016_at	38095_i_at	0.9922	LCK	CD74	HLA-DPB1
38319_at	35016_at	37039_at	0.9922	CD3D	CD74	HLA-DRA
38147_at	33238_at	38833_at	0.9922	SH2D1A	LCK	HLA-DPA1
38147_at	37039_at	38095_i_at	0.9844	SH2D1A	HLA-DRA	HLA-DPB1
38147_at	35016_at	38833_at	0.9844	SH2D1A	CD74	HLA-DPA1
38147_at	35016_at	38095_i_at	0.9844	SH2D1A	CD74	HLA-DPB1
35016_at	2059_s_at	38095_i_at	0.9844	CD74	LCK	HLA-DPB1
38147_at	37039_at	38833_at	0.9844	SH2D1A	HLA-DRA	HLA-DPA1
35016_at	2059_s_at	38833_at	0.9844	CD74	LCK	HLA-DPA1
38319_at	37039_at	38095_i_at	0.9844	CD3D	HLA-DRA	HLA-DPB1
37039_at	38095_i_at	38949_at	0.9844	HLA-DRA	HLA-DPB1	PRKCQ
38147_at	38833_at	33039_at	0.9844	SH2D1A	HLA-DPA1	TRAT1
38095_i_at	38833_at	38949_at	0.9844	HLA-DPB1	HLA-DPA1	PRKCQ
33238_at	35016_at	38833_at	0.9844	LCK	CD74	HLA-DPA1
38319_at	38095_i_at	33039_at	0.9844	CD3D	HLA-DPB1	TRAT1
2059_s_at	37039_at	38833_at	0.9844	LCK	HLA-DRA	HLA-DPA1
2059_s_at	38833_at	33039_at	0.9766	LCK	HLA-DPA1	TRAT1
2059_s_at	38095_i_at	33039_at	0.9766	LCK	HLA-DPB1	TRAT1
2059_s_at	38095_i_at	38833_at	0.9766	LCK	HLA-DPB1	HLA-DPA1
33238_at	2059_s_at	38095_i_at	0.9766	LCK	LCK	HLA-DPB1
35016_at	38095_i_at	33039_at	0.9766	CD74	HLA-DPB1	TRAT1
38147_at	38095_i_at	33039_at	0.9766	SH2D1A	HLA-DPB1	TRAT1
33238_at	2059_s_at	33039_at	0.9766	LCK	LCK	TRAT1
35016_at	37039_at	33039_at	0.9766	CD74	HLA-DRA	TRAT1
33238_at	38095_i_at	33039_at	0.9766	LCK	HLA-DPB1	TRAT1
33238_at	38833_at	33039_at	0.9766	LCK	HLA-DPA1	TRAT1
35016_at	38833_at	33039_at	0.9766	CD74	HLA-DPA1	TRAT1
33238_at	38095_i_at	38833_at	0.9688	LCK	HLA-DPB1	HLA-DPA1
37039_at	38833_at	33039_at	0.9688	HLA-DRA	HLA-DPA1	TRAT1
38147_at	35016_at	37039_at	0.9688	SH2D1A	CD74	HLA-DRA
33238_at	2059_s_at	38833_at	0.9688	LCK	LCK	HLA-DPA1
37039_at	38095_i_at	33039_at	0.9688	HLA-DRA	HLA-DPB1	TRAT1
38095_i_at	38833_at	33039_at	0.9609	HLA-DPB1	HLA-DPA1	TRAT1
35016_at	38095_i_at	38833_at	0.9609	CD74	HLA-DPB1	HLA-DPA1
37039_at	38095_i_at	38833_at	0.9531	HLA-DRA	HLA-DPB1	HLA-DPA1
35016_at	37039_at	38095_i_at	0.9531	CD74	HLA-DRA	HLA-DPB1
35016_at	37039_at	38833_at	0.9531	CD74	HLA-DRA	HLA-DPA1

Table 2. Annotations of all the features from the dataset ALL1. This is a binary classification dataset between B-cell and T-cell ALL samples. The gene symbols were collected for all the microarray features in the last three columns.

Subscription Required. Please recommend JoVE to your librarian.

Discussion

This study presents an easy-to-follow multi-solution biomarker detection and characterization protocol for a user-specified binary classification dataset. The software puts an emphasis on user-friendliness and flexible import/export interfaces for various file formats, allowing a biomedical researcher to investigate their dataset easily using the GUI of the software. This study also highlights the necessity of generating more than one solution with similarly effective modeling performances, previously ignored by many existing biomarker detection algorithms. In the future, newly developed biomarker detection algorithms may include this option by recording all the intermediate biomarker subsets with sufficient modeling performances.

In this protocol, steps 1 and 5 are of most importance, as the software is a fully automatic system that relies on correctly formatted input files. It was found that during our testing step, the mis-match of sample names from data matrix and class labels files may cause errors in the software, where the software will pop out a warning dialog about this error. Therefore, if the user finds no samples were loaded from the data matrix or class label files, the troubleshooting trick is to double-check whether the sample names in the two input files are inconsistent. If no dots were visualized in the 3D scatter plots, this may be due to the parameter pCutoff being higher than the best solution. In this instance, the troubleshooting trick is to lower the cutoff of the classification performance measurement (parameter pCutoff). However, the maximum performance measurement achieved by the biomarker subsets may be still blocked by the cutoff for a difficult dataset. A warning dialog will give this best performance measurement, and the user may choose a smaller cutoff to continue further analysis.

The main limitations of the software are its slow calculation speed and its ability to only focus on, at most, three features. Feature selection is an NP-hard problem, defined as a computational problem whose globally optimal solution cannot be resolved within polynomial time⁵⁹. The comprehensive biomarker subset screening step consumes a high volume of computational power. The running time complexity of kSolutionVis is O(n³) where n is the parameter pTopX. Additionally, this multiple-biomarker detection algorithm focuses on visualizing the screen of features, therefore confining the number of the features to three or fewer. This limitation may impede some users who may work on difficult problems and wish to find feature subsets consisting of more than three features. However, the software visualizes feature subsets in the 3D space and it's difficult to directly visualize feature subsets in more than three dimensions. In addition, based on the representative results presented above, the multiple feature triplets selected by kSolutionVis is a highly effective method in classification and shows significant results with important biomedical meaning.

The software represents useful complementary software to the existing feature selection algorithms. In the field of biomedicine, feature selection is termed biomarker, with the goal to find a subset of features achieving improved modeling performance⁶⁰^,⁶¹^,⁶². The software is a comprehensive screening tool of all the triplet biomarker subsets based on the strategy proposed in a recent study⁵. The two representative datasets screened by the software's protocol, and their results demonstrate the existences of quite a few solutions with similarly effective or even identical modeling performances. However, heuristic rules⁶³^,⁶⁴^,⁶⁵^,⁶⁶ may be employed to find sub-optimal solutions, but such algorithms have a strong tendency to produce only one solution, ignoring many other solutions with similarly effective or even identical modeling performances. Therefore, the computer power and the lengthy running time of the software are worthwhile to ensure a more comprehensive detection of potential biomarkers in the future.

The representative results were calculated on two transcriptome datasets, however, the software handles input data in various standard file formats and may also be used to analyze other 'omic' datasets, including proteomics and metabolomics. Additionally, parallelization may speed up the calculation of the biomarker detection module in the software. There is some multi-core hardware including GPGPU (General-Purpose Graphical Processing Unite) and Intel Xeon Phi processors available for this purpose. However, these technologies require different coding strategies and will be considered in the next version of the software.

Subscription Required. Please recommend JoVE to your librarian.

Disclosures

We have no conflicts of interest related to this report.

Acknowledgments

This work was supported by the Strategic Priority Research Program of the Chinese Academy of Sciences (XDB13040400) and the startup grant from Jilin University. Anonymous reviewers and biomedical testing users were appreciated for their constructive comments on improving the usability and functionality of kSolutionVis.

Materials

Name	Company	Catalog Number	Comments
Hardware
laptop	Lenovo	X1 carbon	Any computer works. Recommended minimum configuration: 1GB extra hard disk space, 1 GB memory, 2.0MHz CPU
Name	Company	Catalog Number	Comments
Software
Python 3.0	WingWare	Wing Personal	Any python programming and running environments support Python version 3.0 or above