Here, we introduce a protocol for converting transcriptomic data into a mqTrans view, enabling the identification of dark biomarkers. While not differentially expressed in conventional transcriptomic analyses, these biomarkers exhibit differential expression in the mqTrans view. The approach serves as a complementary technique to traditional methods, unveiling previously overlooked biomarkers.
Transcriptome represents the expression levels of many genes in a sample and has been widely used in biological research and clinical practice. Researchers usually focused on transcriptomic biomarkers with differential representations between a phenotype group and a control group of samples. This study presented a multitask graph-attention network (GAT) learning framework to learn the complex inter-genic interactions of the reference samples. A demonstrative reference model was pre-trained on the healthy samples (HealthModel), which could be directly used to generate the model-based quantitative transcriptional regulation (mqTrans) view of the independent test transcriptomes. The generated mqTrans view of transcriptomes was demonstrated by prediction tasks and dark biomarker detection. The coined term "dark biomarker" stemmed from its definition that a dark biomarker showed differential representation in the mqTrans view but no differential expression in its original expression level. A dark biomarker was always overlooked in traditional biomarker detection studies due to the absence of differential expression. The source code and the manual of the pipeline HealthModelPipe can be downloaded from http://www.healthinformaticslab.org/supp/resources.php.
Transcriptome consists of the expressions of all the genes in a sample and may be profiled by high-throughput technologies like microarray and RNA-seq1. The expression levels of one gene in a dataset are called a transcriptomic feature, and the differential representation of a transcriptomic feature between the phenotype and control groups defines this gene as a biomarker of this phenotype2,3. Transcriptomic biomarkers have been extensively utilized in the investigations of disease diagnosis4, biological mechanism5, and survival analysis6,7, etc.
Gene activity patterns in the healthy tissues carry crucial information about the lives8,9. These patterns offer invaluable insights and act as ideal references for understanding the complex developmental trajectories of benign disorders10,11 and lethal diseases12. Genes interact with each other, and transcriptomes represent the final expression levels after their complicated interactions. Such patterns are formulated as transcriptional regulation network13 and metabolism network14, etc. The expressions of messenger RNAs (mRNAs) can be transcriptionally regulated by transcription factors (TFs) and long intergenic non-coding RNAs (lincRNAs)15,16,17. Conventional differential expression analysis ignored such complex gene interactions with the assumption of inter-feature independence18,19.
Recent advancements in graph neural networks (GNNs) demonstrate extraordinary potential in extracting important information from OMIC-based data for cancer studies20, e.g., identifying co-expression modules21. The innate capacity of GNNs renders them ideal for modeling the intricate relationships and dependencies among genes22,23.
Biomedical studies often focus on accurately predicting a phenotype against the control group. Such tasks are commonly formulated as binary classifications24,25,26. Here, the two class labels are typically encoded as 1 and 0, true and false, or even positive and negative27.
This study aimed to provide an easy-to-use protocol for generating the transcriptional regulation (mqTrans) view of a transcriptome dataset based on the pre-trained graph-attention network (GAT) reference model. The multitask GAT framework from a previously published work26 was used to transform transcriptomic features to the mqTrans features. A large dataset of healthy transcriptomes from the University of California, Santa Cruz (UCSC) Xena platform28 was used to pre-train the reference model (HealthModel), which quantitatively measured the transcription regulations from the regulatory factors (TFs and lincRNAs) to the target mRNAs. The generated mqTrans view could be used to build prediction models and detect dark biomarkers. This protocol utilizes the colon adenocarcinoma (COAD) patient dataset from The Cancer Genome Atlas (TCGA) database29 as an illustrative example. In this context, patients in stages I or II are categorized as negative samples, while those in stages III or IV are considered positive samples. The distributions of dark and traditional biomarkers across the 26 TCGA cancer types are also compared.
Description of the HealthModel pipeline
The methodology employed in this protocol is based on the previously published framework26, as outlined in Figure 1. To commence, users are required to prepare the input dataset, feed it into the proposed HealthModel pipeline, and obtain mqTrans features. Detailed data preparation instructions are provided in section 2 of the protocol section. Subsequently, users have the option to combine mqTrans features with the original transcriptomic features or proceed with the generated mqTrans features only. The produced dataset is then subjected to a feature selection process, with users having the flexibility to choose their preferred value for k in k-fold cross-validation for classification. The primary evaluation metric utilized in this protocol is accuracy.
HealthModel26 categorizes the transcriptomic features into three distinct groups: TF (Transcription Factor), lincRNA (long intergenic non-coding RNA), and mRNA (messenger RNA). The TF features are defined based on the annotations available in the Human Protein Atlas30,31. This work utilizes the annotations of lincRNAs from the GTEx dataset32. Genes belonging to the third-level pathways in the KEGG database33 are considered as mRNA features. It is worth noting that if an mRNA feature exhibits regulatory roles for a target gene as documented in the TRRUST database34, it is reclassified into the TF class.
This protocol also manually generates the two example files for the gene IDs of regulatory factors (regulatory_geneIDs.csv) and target mRNA (target_geneIDs.csv). The pairwise distance matrix among the regulatory features (TFs and lincRNAs) is calculated by the Pearson correlation coefficients and clustered by the popular tool weighted gene co-expression network analysis (WGCNA)36 (adjacent_matrix.csv). Users can directly utilize the HealthModel pipeline together with these example configuration files to generate the mqTrans view of a transcriptomic dataset.
Technical details of HealthModel
HealthModel represents the intricate relationships among TFs and lincRNAs as a graph, with the input features serving as the vertices denoted by V and an inter-vertex edge matrix designated as E. Each sample is characterized by K regulatory features, symbolized as VK×1. Specifically, the dataset encompassed 425 TFs and 375 lincRNAs, resulting in a sample dimensionality of K = 425 + 375 = 800. To establish the edge matrix E, this work employed the popular tool WGCNA35. The pairwise weight linking two vertices represented as and , is determined by the Pearson correlation coefficient. The gene regulatory network exhibits a scale-free topology36, characterized by the presence of hub genes with pivotal functional roles. We compute the correlation between two features or vertices, and , using the topological overlap measure (TOM) as follows:
(1)
(2)
The soft threshold β is calculated using the 'pickSoft Threshold' function from the WGCNA package. The power exponential function aij is applied, where represents a gene excluding i and j, and represents the vertex connectivity. WGCNA clusters the expression profiles of the transcriptomic features into multiple modules using a commonly employed dissimilarity measure (37.
The HealthModel framework was originally designed as a multitask learning architecture26. This protocol only utilizes the model pre-training task for the construction of the transcriptomic mqTrans view. The user may choose to further refine the pre-trained HealthModel under the multitask graph attention network with additional task-specific transcriptomic samples.
Technical details of feature selection and classification
The feature selection pool implements eleven feature selection (FS) algorithms. Among them, three are filter-based FS algorithms: selecting K best features using the Maximal Information Coefficient (SK_mic), selecting K features based on the FPR of MIC (SK_fpr), and selecting K features with the highest false discovery rate of MIC (SK_fdr). Additionally, three tree-based FS algorithms assess individual features using a decision tree with the Gini index (DT_gini), adaptive boosted decision trees (AdaBoost), and random forest (RF_fs). The pool also incorporates two wrapper methods: Recursive feature elimination with the linear support vector classifier (RFE_SVC) and recursive feature elimination with the logistic regression classifier (RFE_LR). Finally, two embedding algorithms are included: linear SVC classifier with the top-ranked L1 feature importance values (lSVC_L1) and logistic regression classifier with the top-ranked L1 feature importance values (LR_L1).
The classifier pool employs seven different classifiers to build classification models. These classifiers comprise linear support vector machine (SVC), Gaussian Naïve Bayes (GNB), logistic regression classifier (LR), k-nearest neighbor, with k set to 5 by default (KNN), XGBoost, random forest (RF), and decision tree (DT).
The random split of the dataset into the train: test subsets can be set in the command line. The demonstrated example uses the ratio of train: test = 8: 2.
NOTE: The following protocol describes the details of the informatics analytic procedure and Python commands of the major modules. Figure 2 illustrates the three major steps with example commands utilized in this protocol and refer to previously published works26,38 for more technical details. Do the following protocol under a normal user account in a computer system and avoid using the administrator or root account. This is a computational protocol and has no biomedical hazardous factors.
1. Prepare Python environment
2. Using the pre-trained HealthModel to generate the mqTrans features
3. Select mqTrans Features
Evaluation of the mqTrans view of the transcriptomic dataset
The test code uses eleven feature selection (FS) algorithms and seven classifiers to evaluate how the generated mqTrans view of the transcriptomic dataset contributes to the classification task (Figure 6). The test dataset consists of 317 colon adenocarcinoma (COAD) from The Cancer Genome Atlas (TCGA) database29. The COAD patients at stages I or II are regarded as the negative samples, while those at stages III or IV are the positive ones.
Eleven FS algorithms are implemented in the test code. There are three filter-based FS algorithms, including, select K best features by MIC (SK_mic), select K features by the FPR of MIC (SK_fpr), and select K features by the highest FDR of MIC (SK_fpr). Three tree-based FS algorithms evaluate the individual features by a decision tree with gini index (DT_gini), the adaptive boosted decision trees (AdaBoost) and the random forest (RF_fs), respectively. The FS pool of the test code also evaluates two wrappers recursive feature elimination (RFE) with the linear support vector classifier (SVC)(RFE_SVC) and RFE with the logistic regression classifier (RFE_LR), and two embedding algorithms linear SVC classifier with the top-ranked L1 feature importance values (lSVC_L1) and logistic regression classifier with the top-ranked L1 feature importance values (LR_L1).
The test code builds the classification models using seven classifiers, including linear support vector machine (SVC), Gaussian Naïve Bayes (GNB), logistic regression classifier (LR), k-nearest neighbor, k-5 by default (KNN), XGBoost, random forest (RF) and decision tree (DT).
Figure 6 shows the maximum test accuracy of the mqTrans features, the original mRNA features, and the combined subset of the mRNA and mqTrans features recommended by each FS algorithm.
The combined feature subsets (mRNA+mqTrans) have achieved the highest accuracy 0.7656 on the "SK_fpr" FS method, better than the individual feature types mqTrans (0.7188) and original mRNA (0.7188). Similar patterns can be observed for the other FS algorithms. The user can check the selected features in the output file Output-SelectedFeatures.csv.
Detecting the dark biomarkers
Previous studies showed the existence of the undifferentially expressed genes with significantly differentially represented mqTrans values between the phenotypic and control groups26,38,39. These genes are called dark biomarkers because traditional biomarker detection studies ignore them by their undifferential expressions. The statistical analysis function t.test in Microsoft Excel can be used to define a feature that is differentially expressed if its statistical p-value is smaller than 0.05.
Among the 3062 features with the generated mqTrans values 221 dark biomarkers were detected (Figure 7). The third-ranked gene ENSG00000163697 (APBB2, Amyloid Beta Precursor Protein Binding Family B Member 2) shows significantly differentially represented mqTrans values (mqTrans.P = 2.03 x 10-4) while its original expression level shows no differential expression (mRNA.P = 3.80 x 10-1). The keyword APBB2 hit 27 publications in the PubMed database40, but no connections with the colon or intestine were detected.
Another gene ENSG00000048052 (HDAC9, Histone Deacetylase 9) has the differentially represented mqTrans values (mqTrans.P = 6.09 x 10-3) while maintaining practically the same normal distributions between the phenotypic and control groups (mRNA.P = 9.62 x 10-1). The keyword HDAC9 hit 417 publications in the PubMed database. Three studies also mentioned the keywords "colon" or "intestine" in the abstracts41,42,43. But, none of them investigated the roles of HDAC9 in colon cancer.
The data suggested the necessity of further evaluations of these dark biomarkers from their post-transcription activities, e.g., the translated protein levels44,45.
Pan-cancer distributions of metabolism-related dark and traditional biomarkers
The metabolism-related traditional biomarkers were screened and compared against dark biomarkers across 26 cancer types in the TCGA dataset38. Both categories of biomarkers underwent statistical evaluation to discern significance levels across early (Stages I and II) and late (Stages III and IV) cancer stages. This evaluation employed Student's t-tests for p-values, subsequently corrected for multiple testing using false discovery rates (FDRs). Detailed data for each of the 26 cancer types are provided in Figure 8.
Genes yielding FDR-corrected p-values below 0.05 were classified as traditional biomarkers. In contrast, dark biomarkers were defined as those with FDR-corrected p-values below 0.05 in the mqTrans view while concurrently exhibiting no statistically significant differences in expression levels.
Figure 9 reveals a general scarcity of dark biomarkers in comparison to traditional biomarkers across most cancer types. Noteworthy exceptions include BRCA, MESO, and TGCT, which manifest a greater prevalence of dark biomarkers. It is revealed that various factors, including transcription factors, methylation patterns, gene mutations, and environmental conditions, could modulate the transcriptional dysregulation of these dark biomarkers. Further complexity may arise due to overlapping non-coding RNA transcripts that could confound the expression levels of dark biomarkers. Transcription dysregulations of some dark biomarkers were supported by their differential protein levels44,45. The dark biomarkers are often overlooked in traditional studies and present intriguing avenues for future mechanistic investigations.
Figure 1: An overview of the HealthModel and feature selection modules in this protocol. Replace the specific algorithms in the feature selection pool and the classifier pool if the user is familiar with the Python programming. Please click here to view a larger version of this figure.
Figure 2: Complete Code Flow for this Protocol. (A) Prepare Python Environment. To begin, create a virtual environment and install essential packages. For comprehensive instructions, refer to Section 1. (B) Generate mqTrans Features. Obtain mqTrans features by executing the provided code step by step. Detailed explanations can be found in Section 2. (C) Select mqTrans Features. This section focuses on assessing the mqTrans features. Refer to Section 3 for in-depth details. Please click here to view a larger version of this figure.
Figure 3: Prepare environment for Python. (A) The command to create healthmodel. (B) Enter y during the creating VE process. (C) The most common command for activating the VE. (D) The command for installing torch 1.13.1. (E) Install additional libraries for torch-geometric package. (F) Install torch-geometric package. Please click here to view a larger version of this figure.
Figure 4: Run the HealthModel to get mqTrans feature. (A) Download the code. (B) The example of data file. Each column has all the values of a regulatory factor, and the first item is the gene ID. Each row gives the values of a given sample, with the first item being the sample name. (C) The example of a label file. The first column gives the sample names, and the class label of each sample is given in the column titled label. The 0 value in the label column means this sample is alive, 1 means dead. (D) the outputs of mqTrans. Please click here to view a larger version of this figure.
Figure 5: Run the feature selection algorithm for the mqTrans feature. The results of the feature selection algorithm are shown to the user. Please click here to view a larger version of this figure.
Figure 6: The maximum test set accuracy of each feature selection algorithm. The horizontal axis lists the feature selection algorithms, and the vertical axis gives the values of accuracies. The histograms show the experimental data of the three settings, i.e., mqTrans, mRNA, mRNA+mqTrans. Please click here to view a larger version of this figure.
Figure 7: Top 50 dark biomarkers with the smallest p-values in the mqTrans view. The column "Dark Biomarker" gives the dark biomarker names. The columns "mRNA.P" and "mqTrans.P" are the statistical t-test p-values between the phenotypic and control groups. The background colors of the p-values are colored between the p-values 1.00 (blue) and 0.00 (red), and the white color represents p-value = 0.05. Please click here to view a larger version of this figure.
Figure 8: The details of the 26 cancers in The Cancer Genome Atlas (TCGA) at different stages. The columns "Cohort" and "Disease Tissue" describe the patient group and the tissues with disease for each dataset. The last four columns give the numbers of samples in the developmental stages I, II, III, and IV, respectively. Please click here to view a larger version of this figure.
Figure 9: The numbers of dark biomarkers and traditional biomarkers in 26 cancers. The horizontal axis lists the 26 cancer types. The vertical axis gives the numbers of dark biomarkers and traditional biomarkers for these cancer types. Please click here to view a larger version of this figure.
Supplementary Coding File 1: HealthModel-mqTrans-v1-00.tar Please click here to download this File.
Section 2 (Use the pre-trained HealthModel to generate the mqTrans features) of the protocol is the most critical step within this protocol. After preparing the computational working environment in section 1, section 2 generates the mqTrans view of a transcriptomic dataset based on the pre-trained large reference model. Section 3 is a demonstrative example of selecting the generated mqTrans features for biomarker detections and prediction tasks. The users can conduct other transcriptomic analyses on this mqTrans dataset using their own tools or codes.
The original HealthModel framework can further refine the pre-trained HealthModel using the multitask architecture, as described in26. This protocol focuses on the utilization of the pre-trained reference model to generate the mqTrans view of a transcriptomic dataset.
The default pre-trained reference model was established on the healthy samples and may not be a good choice for some specific tasks, e.g., the investigation between the primary and metastatic cancers. The computational speed is also slow for a large transcriptomic dataset.
The significance of this protocol is to provide a complementary mqTrans view of the most abundantly-available OMIC data type, i.e., transcriptome. Dark biomarkers can be revealed from the undifferentially expressed genes ignored by the conventional transcriptomic analysis. A recent study detected seven dark biomarkers of metastatic colon cancer (mCC) based on three independent cohorts of 805 samples in total44. Dark biomarkers received limited wet-lab investigations due to their undifferential expressions. However, one of the detected mCC dark biomarker YTHDC2 encodes the protein YTH domain containing 2, whose protein levels were observed to be positively correlated with the metastasis status of human gastric cancer cells46 and colon cancers47. Novel biological insights of dark biomarkers remain to be resolved through in vitro and in vivo technologies.
This protocol is designed to be fully modular. Reference models pre-trained on other large datasets like primary cancers will facilitate the investigation of tumor metastasis. This protocol will also be explored for applications in other life domains, including plants, fungi, and microbes.
The computational efficiency of this protocol is planned to be enhanced through parallelization and algorithmic optimization.
This protocol describes the procedure to transform a transcriptomic dataset to a new mqTrans view, and the transformed mqTrans values of a gene quantitatively measure the transcription regulation changes compared against the reference samples. A default model was pre-trained on the healthy transcriptomes and released as the reference HealthModel.
The source code of two downstream tasks is provided to facilitate the easy utilization of this protocol by biomedical researchers. The experimental data shows that the transformed mqTrans features could improve the prediction tasks using only the original expression levels. The mqTrans view can also unveil the latent phenotypic connections of some dark biomarkers without differential expressions in the original transcriptomic data.
The authors have nothing to disclose.
This work was supported by the Senior and Junior Technological Innovation Team (20210509055RQ), Guizhou Provincial Science and Technology Projects (ZK2023-297), the Science and Technology Foundation of Health Commission of Guizhou Province (gzwkj2023-565), Science and Technology Project of Education Department of Jilin Province (JJKH20220245KJ and JJKH20220226SK), the National Natural Science Foundation of China (U19A2061), the Jilin Provincial Key Laboratory of Big Data Intelligent Computing (20180622002JC), and the Fundamental Research Funds for the Central Universities, JLU. We extend our sincerest appreciation to the review editor and the three anonymous reviewers for their constructive critiques, which have been instrumental in substantially enhancing the rigor and clarity of this protocol.
Anaconda | Anaconda | version 2020.11 | Python programming platform |
Computer | N/A | N/A | Any general-purpose computers satisfy the requirement |
GPU card | N/A | N/A | Any general-purpose GPU cards with the CUDA computing library |
pytorch | Pytorch | version 1.13.1 | Software |
torch-geometric | Pytorch | version 2.2.0 | Software |