Author Spotlight: Impact of Intergenic Interactions on Disease-Identifying Dark Biomarkers

Kewei Li; Yusi Fan; Yaqing Liu; Hongmei Liu; Gongyou Zhang; Meiyu Duan; Lan Huang; Fengfeng Zhou

doi:10.3791/66030

Biology

Generating the Transcriptional Regulation View of Transcriptomic Features for Prediction Task and Dark Biomarker Detection on Small Datasets

Published: March 1, 2024 doi: 10.3791/66030

Kewei Li¹, Yusi Fan¹, Yaqing Liu¹, Hongmei Liu², Gongyou Zhang², Meiyu Duan¹, Lan Huang¹, Fengfeng Zhou¹

¹College of Computer Science and Technology, and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, ²School of Biology and Engineering, Guizhou Medical University

Summary

Here, we introduce a protocol for converting transcriptomic data into a mqTrans view, enabling the identification of dark biomarkers. While not differentially expressed in conventional transcriptomic analyses, these biomarkers exhibit differential expression in the mqTrans view. The approach serves as a complementary technique to traditional methods, unveiling previously overlooked biomarkers.

Abstract

Transcriptome represents the expression levels of many genes in a sample and has been widely used in biological research and clinical practice. Researchers usually focused on transcriptomic biomarkers with differential representations between a phenotype group and a control group of samples. This study presented a multitask graph-attention network (GAT) learning framework to learn the complex inter-genic interactions of the reference samples. A demonstrative reference model was pre-trained on the healthy samples (HealthModel), which could be directly used to generate the model-based quantitative transcriptional regulation (mqTrans) view of the independent test transcriptomes. The generated mqTrans view of transcriptomes was demonstrated by prediction tasks and dark biomarker detection. The coined term "dark biomarker" stemmed from its definition that a dark biomarker showed differential representation in the mqTrans view but no differential expression in its original expression level. A dark biomarker was always overlooked in traditional biomarker detection studies due to the absence of differential expression. The source code and the manual of the pipeline HealthModelPipe can be downloaded from http://www.healthinformaticslab.org/supp/resources.php.

Introduction

Transcriptome consists of the expressions of all the genes in a sample and may be profiled by high-throughput technologies like microarray and RNA-seq¹. The expression levels of one gene in a dataset are called a transcriptomic feature, and the differential representation of a transcriptomic feature between the phenotype and control groups defines this gene as a biomarker of this phenotype²^,³. Transcriptomic biomarkers have been extensively utilized in the investigations of disease diagnosis⁴, biological mechanism⁵, and survival analysis⁶^,⁷, etc.

Gene activity patterns in the healthy tissues carry crucial information about the lives⁸^,⁹. These patterns offer invaluable insights and act as ideal references for understanding the complex developmental trajectories of benign disorders¹⁰^,¹¹ and lethal diseases¹². Genes interact with each other, and transcriptomes represent the final expression levels after their complicated interactions. Such patterns are formulated as transcriptional regulation network¹³ and metabolism network¹⁴, etc. The expressions of messenger RNAs (mRNAs) can be transcriptionally regulated by transcription factors (TFs) and long intergenic non-coding RNAs (lincRNAs)¹⁵^,¹⁶^,¹⁷. Conventional differential expression analysis ignored such complex gene interactions with the assumption of inter-feature independence¹⁸^,¹⁹.

Recent advancements in graph neural networks (GNNs) demonstrate extraordinary potential in extracting important information from OMIC-based data for cancer studies²⁰, e.g., identifying co-expression modules²¹. The innate capacity of GNNs renders them ideal for modeling the intricate relationships and dependencies among genes²²^,²³.

Biomedical studies often focus on accurately predicting a phenotype against the control group. Such tasks are commonly formulated as binary classifications²⁴^,²⁵^,²⁶. Here, the two class labels are typically encoded as 1 and 0, true and false, or even positive and negative²⁷.

This study aimed to provide an easy-to-use protocol for generating the transcriptional regulation (mqTrans) view of a transcriptome dataset based on the pre-trained graph-attention network (GAT) reference model. The multitask GAT framework from a previously published work²⁶ was used to transform transcriptomic features to the mqTrans features. A large dataset of healthy transcriptomes from the University of California, Santa Cruz (UCSC) Xena platform²⁸ was used to pre-train the reference model (HealthModel), which quantitatively measured the transcription regulations from the regulatory factors (TFs and lincRNAs) to the target mRNAs. The generated mqTrans view could be used to build prediction models and detect dark biomarkers. This protocol utilizes the colon adenocarcinoma (COAD) patient dataset from The Cancer Genome Atlas (TCGA) database²⁹ as an illustrative example. In this context, patients in stages I or II are categorized as negative samples, while those in stages III or IV are considered positive samples. The distributions of dark and traditional biomarkers across the 26 TCGA cancer types are also compared.

Description of the HealthModel pipeline
The methodology employed in this protocol is based on the previously published framework²⁶, as outlined in Figure 1. To commence, users are required to prepare the input dataset, feed it into the proposed HealthModel pipeline, and obtain mqTrans features. Detailed data preparation instructions are provided in section 2 of the protocol section. Subsequently, users have the option to combine mqTrans features with the original transcriptomic features or proceed with the generated mqTrans features only. The produced dataset is then subjected to a feature selection process, with users having the flexibility to choose their preferred value for k in k-fold cross-validation for classification. The primary evaluation metric utilized in this protocol is accuracy.

HealthModel²⁶ categorizes the transcriptomic features into three distinct groups: TF (Transcription Factor), lincRNA (long intergenic non-coding RNA), and mRNA (messenger RNA). The TF features are defined based on the annotations available in the Human Protein Atlas³⁰^,³¹. This work utilizes the annotations of lincRNAs from the GTEx dataset³². Genes belonging to the third-level pathways in the KEGG database³³ are considered as mRNA features. It is worth noting that if an mRNA feature exhibits regulatory roles for a target gene as documented in the TRRUST database³⁴, it is reclassified into the TF class.

This protocol also manually generates the two example files for the gene IDs of regulatory factors (regulatory_geneIDs.csv) and target mRNA (target_geneIDs.csv). The pairwise distance matrix among the regulatory features (TFs and lincRNAs) is calculated by the Pearson correlation coefficients and clustered by the popular tool weighted gene co-expression network analysis (WGCNA)³⁶ (adjacent_matrix.csv). Users can directly utilize the HealthModel pipeline together with these example configuration files to generate the mqTrans view of a transcriptomic dataset.

Technical details of HealthModel
HealthModel represents the intricate relationships among TFs and lincRNAs as a graph, with the input features serving as the vertices denoted by V and an inter-vertex edge matrix designated as E. Each sample is characterized by K regulatory features, symbolized as V^K×1. Specifically, the dataset encompassed 425 TFs and 375 lincRNAs, resulting in a sample dimensionality of K = 425 + 375 = 800. To establish the edge matrix E, this work employed the popular tool WGCNA³⁵. The pairwise weight linking two vertices represented as Equation 1 and Equation 2 , is determined by the Pearson correlation coefficient. The gene regulatory network exhibits a scale-free topology³⁶, characterized by the presence of hub genes with pivotal functional roles. We compute the correlation between two features or vertices, and , using the topological overlap measure (TOM) as follows:

Equation 3 (1)

Equation 4 (2)

The soft threshold β is calculated using the 'pickSoft Threshold' function from the WGCNA package. The power exponential function a_ij is applied, where Equation 5 represents a gene excluding i and j, and Equation 6 represents the vertex connectivity. WGCNA clusters the expression profiles of the transcriptomic features into multiple modules using a commonly employed dissimilarity measure ( Equation 7 ³⁷.

The HealthModel framework was originally designed as a multitask learning architecture²⁶. This protocol only utilizes the model pre-training task for the construction of the transcriptomic mqTrans view. The user may choose to further refine the pre-trained HealthModel under the multitask graph attention network with additional task-specific transcriptomic samples.

Technical details of feature selection and classification
The feature selection pool implements eleven feature selection (FS) algorithms. Among them, three are filter-based FS algorithms: selecting K best features using the Maximal Information Coefficient (SK_mic), selecting K features based on the FPR of MIC (SK_fpr), and selecting K features with the highest false discovery rate of MIC (SK_fdr). Additionally, three tree-based FS algorithms assess individual features using a decision tree with the Gini index (DT_gini), adaptive boosted decision trees (AdaBoost), and random forest (RF_fs). The pool also incorporates two wrapper methods: Recursive feature elimination with the linear support vector classifier (RFE_SVC) and recursive feature elimination with the logistic regression classifier (RFE_LR). Finally, two embedding algorithms are included: linear SVC classifier with the top-ranked L1 feature importance values (lSVC_L1) and logistic regression classifier with the top-ranked L1 feature importance values (LR_L1).

The classifier pool employs seven different classifiers to build classification models. These classifiers comprise linear support vector machine (SVC), Gaussian Naïve Bayes (GNB), logistic regression classifier (LR), k-nearest neighbor, with k set to 5 by default (KNN), XGBoost, random forest (RF), and decision tree (DT).

The random split of the dataset into the train: test subsets can be set in the command line. The demonstrated example uses the ratio of train: test = 8: 2.

Subscription Required. Please recommend JoVE to your librarian.

Protocol

NOTE: The following protocol describes the details of the informatics analytic procedure and Python commands of the major modules. Figure 2 illustrates the three major steps with example commands utilized in this protocol and refer to previously published works²⁶^,³⁸ for more technical details. Do the following protocol under a normal user account in a computer system and avoid using the administrator or root account. This is a computational protocol and has no biomedical hazardous factors.

1. Prepare Python environment

Create a virtual environment.
1. This study used the Python programming language and a Python virtual environment (VE) with Python 3.7. Follow these steps (Figure 3A):
  conda create -n healthmodel python=3.7
  conda create is the command to create a new VE. The parameter -n specifies the name of the new environment, in this case, healthmodel. And python=3.7 specifies the Python version to be installed. Choose any preferred name and Python version supporting the above command.
2. After running the command, the output is similar to Figure 3B. Enter y and wait for the process to complete.
Activate the virtual environment
1. In most cases, activate the created VE with the following command (Figure 3C):
  conda activate healthmodel
2. Follow the platform-specific instructions for the VE activation, if some platforms require the user to upload the platform-specific configuration files for activation.
Install PyTorch 1.13.1
1. PyTorch is a popular Python package for artificial intelligence (AI) algorithms. Use PyTorch 1.13.1, based on the CUDA 11.7 GPU programming platform, as an example. Find other versions at https://pytorch.org/get-started/previous-versions/. Use the following command (Figure 3D):
  pip3 install torch torchvision torchaudio
  NOTE: Using PyTorch version 1.12 or newer is strongly recommended. Otherwise, installing the required package torch_geometric may be challenging, as noted on the official torch_geometric website: https://pytorch-geometric.readthedocs.io/en/latest/install/installation.html.
Install additional packages for torch-geometric
1. Following the guidelines at https://pytorch-geometric.readthedocs.io/en/latest/install/installation.html, install the following packages: torch_scatter, torch_sparse, torch_cluster, and torch_spline_conv using the command (Figure 3E):
  pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-1.13.0+cu117.html
Install torch-geometric package.
1. This study requires a specific version, 2.2.0, of the torch-geometric package. Run the command (Figure 3F):
  pip install torch_geometric==2.2.0
Install other packages.
1. Packages like pandas are usually available by default. If not, install them using the pip command. For instance, to install pandas and xgboost, run:
  pip install pandas
  pip install xgboost

2. Using the pre-trained HealthModel to generate the mqTrans features

Download the code and the pre-trained model.
1. Download the code and the pre-trained HealthModel from the website: http://www.healthinformaticslab.org/supp/resources.php, which is named HealthModel-mqTrans-v1-00.tar.gz (Figure 4A). The downloaded file can be decompressed to a user-specified path. The detailed formulation and the supporting data of the implemented protocol can be found in²⁶.
Introduce the parameters to run HealthModel.
1. Firstly, change the working directory to the HealthModel-mqTrans folder in the command line. Use the following syntax for running the code:
  python main.py <data folder> <model folder> <output folder>
  The details regarding each parameter and the data, model and output folders are as follows:
  data folder: This is the source data folder, and each data file is in the csv format. This data folder has two files (see detailed descriptions in steps 2.3 and 2.4). These files need to be replaced with personal data.
  data.csv: The transcriptomic matrix file. The first row lists the feature (or gene) IDs, and the first column gives the sample IDs. The list of genes includes the regulatory factors (TFs and lincRNAs), and the regulated mRNA genes.
  label.csv: The sample label file. The first column lists the sample IDs, and the column with the name "label" gives the sample label.
  model folder: The folder to save information about the model:
  HealthModel.pth: The pre-trained HealthModel.
  regulatory_geneIDs.csv: The regulatory gene IDs used in this study.
  target_geneIDs.csv: The target genes used in this study.
  adjacent_matrix.csv: The adjacent matrix of regulatory genes.
  output folder: The output files are written to this folder, created by the code.
  test_target.csv: The gene expression value of target genes after Z-Normalization and imputation.
  pred_target.csv: The predicted gene expression value of target genes.
  mq_target.csv: The predicted gene expression value of target genes.
Prepare the transcriptomic matrix file in the csv format.
1. Each row represents a sample, and each column represents a gene (Figure 4B). Name the transcriptomic data matrix file as data.csv in the data folder.
  NOTE: This file may be generated by manually saving a data matrix in the .csv format from software such as Microsoft Excel. The transcriptomic matrix may also be generated by computer programming.
Prepare the label file in the csv format.
1. Similar to the transcriptomic matrix file, name the label file as label.csv in the data folder (Figure 4C).
  NOTE: The first column gives the sample names, and the class label of each sample is given in the column titled label. The 0 value in the label column means this sample is negative, 1 means a positive sample.
Generate the mqTrans features.
1. Run the following command to generate the mqTrans features and get the outputs shown in Figure 4D. The mqTrans features are generated as the file ./output/mq_targets.csv, and the label file is resaved as the file ./output/label.csv. For the convenience of further analysis, the original expression values of the mRNA genes are also extracted as the file ./output/ test_target.csv.
  python ./Get_mqTrans/code/main.py ./data ./Get_mqTrans/model ./output

3. Select mqTrans Features

Syntax of the feature selection code
1. Firstly, change the working directory to the HealthModel-mqTrans folder. Use the following syntax:
  python ./FS_classification/testMain.py <in-data-file.csv> <in-label-file.csv> <output folder> <select_feature_number> <test_size> <combine> <combine file>
  The details of each parameter are as follows:
  in-data-file: The input data file
  in-label-file: The label of the input data file
  output folder: Two output files are saved in this folder, including Output-score.xlsx (the feature selection method and the accuracy of the corresponding classifier), and Output-SelectedFeatures.xlsx (the selected feature names for each feature selection algorithm).
  1. select_feature_number: select the number of features, ranging from 1 to the number of the features of the data file.
  2. test_size: Set the ratio of the test sample to split. For example, 0.2 means that the input dataset is randomly split into the train: test subsets by the ratio of 0.8:0.2.
  3. combine: If true, combine two data files together for feature selection, i.e., the original expression values and the mqTrans features. If false, just use one data file for feature selection, i.e., the original expression values or the mqTrans features.
  4. combine file: If combine is true, provide this file name to save the combined data matrix.
    NOTE: This pipeline aims to demonstrate how the generated mqTrans features perform on classification tasks, and it directly uses the file generated by section 2 for the following operations.
Run feature selection algorithm for mqTrans feature selection.
1. Turn combine =False if the user selects mqTrans features or original features.
2. Firstly, select 800 original features and split the dataset into train: test=0.8:0.2:
  python ./FS_classification/testMain.py ./output/test_target.csv ./output/label.csv ./result 800 0.2 False
3. Turn combine =True, if the user wants to combine the mqTrans features with the original expression values to select features. Here, the demonstrative example is to select 800 features and split the dataset into train: test=0.8:0.2:
  python ./FS_classification/testMain.py ./output/mq_targets.csv ./output/label.csv ./result_combine 800 0.2 True ./output/test_target.csv
  NOTE: Figure 5 shows the output information. The supplementary files required for this protocol are in HealthModel-mqTrans-v1-00.tar folder (Supplementary Coding File 1).

Subscription Required. Please recommend JoVE to your librarian.

Representative Results

Evaluation of the mqTrans view of the transcriptomic dataset
The test code uses eleven feature selection (FS) algorithms and seven classifiers to evaluate how the generated mqTrans view of the transcriptomic dataset contributes to the classification task (Figure 6). The test dataset consists of 317 colon adenocarcinoma (COAD) from The Cancer Genome Atlas (TCGA) database²⁹. The COAD patients at stages I or II are regarded as the negative samples, while those at stages III or IV are the positive ones.

Eleven FS algorithms are implemented in the test code. There are three filter-based FS algorithms, including, select K best features by MIC (SK_mic), select K features by the FPR of MIC (SK_fpr), and select K features by the highest FDR of MIC (SK_fpr). Three tree-based FS algorithms evaluate the individual features by a decision tree with gini index (DT_gini), the adaptive boosted decision trees (AdaBoost) and the random forest (RF_fs), respectively. The FS pool of the test code also evaluates two wrappers recursive feature elimination (RFE) with the linear support vector classifier (SVC)(RFE_SVC) and RFE with the logistic regression classifier (RFE_LR), and two embedding algorithms linear SVC classifier with the top-ranked L1 feature importance values (lSVC_L1) and logistic regression classifier with the top-ranked L1 feature importance values (LR_L1).

The test code builds the classification models using seven classifiers, including linear support vector machine (SVC), Gaussian Naïve Bayes (GNB), logistic regression classifier (LR), k-nearest neighbor, k-5 by default (KNN), XGBoost, random forest (RF) and decision tree (DT).

Figure 6 shows the maximum test accuracy of the mqTrans features, the original mRNA features, and the combined subset of the mRNA and mqTrans features recommended by each FS algorithm.

The combined feature subsets (mRNA+mqTrans) have achieved the highest accuracy 0.7656 on the "SK_fpr" FS method, better than the individual feature types mqTrans (0.7188) and original mRNA (0.7188). Similar patterns can be observed for the other FS algorithms. The user can check the selected features in the output file Output-SelectedFeatures.csv.

Detecting the dark biomarkers
Previous studies showed the existence of the undifferentially expressed genes with significantly differentially represented mqTrans values between the phenotypic and control groups²⁶^,³⁸^,³⁹. These genes are called dark biomarkers because traditional biomarker detection studies ignore them by their undifferential expressions. The statistical analysis function t.test in Microsoft Excel can be used to define a feature that is differentially expressed if its statistical p-value is smaller than 0.05.

Among the 3062 features with the generated mqTrans values 221 dark biomarkers were detected (Figure 7). The third-ranked gene ENSG00000163697 (APBB2, Amyloid Beta Precursor Protein Binding Family B Member 2) shows significantly differentially represented mqTrans values (mqTrans.P = 2.03 x 10^-4) while its original expression level shows no differential expression (mRNA.P = 3.80 x 10^-1). The keyword APBB2 hit 27 publications in the PubMed database⁴⁰, but no connections with the colon or intestine were detected.

Another gene ENSG00000048052 (HDAC9, Histone Deacetylase 9) has the differentially represented mqTrans values (mqTrans.P = 6.09 x 10^-3) while maintaining practically the same normal distributions between the phenotypic and control groups (mRNA.P = 9.62 x 10^-1). The keyword HDAC9 hit 417 publications in the PubMed database. Three studies also mentioned the keywords "colon" or "intestine" in the abstracts⁴¹^,⁴²^,⁴³. But, none of them investigated the roles of HDAC9 in colon cancer.

The data suggested the necessity of further evaluations of these dark biomarkers from their post-transcription activities, e.g., the translated protein levels⁴⁴^,⁴⁵.

Pan-cancer distributions of metabolism-related dark and traditional biomarkers
The metabolism-related traditional biomarkers were screened and compared against dark biomarkers across 26 cancer types in the TCGA dataset³⁸. Both categories of biomarkers underwent statistical evaluation to discern significance levels across early (Stages I and II) and late (Stages III and IV) cancer stages. This evaluation employed Student's t-tests for p-values, subsequently corrected for multiple testing using false discovery rates (FDRs). Detailed data for each of the 26 cancer types are provided in Figure 8.

Genes yielding FDR-corrected p-values below 0.05 were classified as traditional biomarkers. In contrast, dark biomarkers were defined as those with FDR-corrected p-values below 0.05 in the mqTrans view while concurrently exhibiting no statistically significant differences in expression levels.

Figure 9 reveals a general scarcity of dark biomarkers in comparison to traditional biomarkers across most cancer types. Noteworthy exceptions include BRCA, MESO, and TGCT, which manifest a greater prevalence of dark biomarkers. It is revealed that various factors, including transcription factors, methylation patterns, gene mutations, and environmental conditions, could modulate the transcriptional dysregulation of these dark biomarkers. Further complexity may arise due to overlapping non-coding RNA transcripts that could confound the expression levels of dark biomarkers. Transcription dysregulations of some dark biomarkers were supported by their differential protein levels⁴⁴^,⁴⁵. The dark biomarkers are often overlooked in traditional studies and present intriguing avenues for future mechanistic investigations.

Figure 1: An overview of the HealthModel and feature selection modules in this protocol. Replace the specific algorithms in the feature selection pool and the classifier pool if the user is familiar with the Python programming. Please click here to view a larger version of this figure.

Figure 2: Complete Code Flow for this Protocol. (A) Prepare Python Environment. To begin, create a virtual environment and install essential packages. For comprehensive instructions, refer to Section 1. (B) Generate mqTrans Features. Obtain mqTrans features by executing the provided code step by step. Detailed explanations can be found in Section 2. (C) Select mqTrans Features. This section focuses on assessing the mqTrans features. Refer to Section 3 for in-depth details. Please click here to view a larger version of this figure.

Figure 3: Prepare environment for Python. (A) The command to create healthmodel. (B) Enter y during the creating VE process. (C) The most common command for activating the VE. (D) The command for installing torch 1.13.1. (E) Install additional libraries for torch-geometric package. (F) Install torch-geometric package. Please click here to view a larger version of this figure.

Figure 4: Run the HealthModel to get mqTrans feature. (A) Download the code. (B) The example of data file. Each column has all the values of a regulatory factor, and the first item is the gene ID. Each row gives the values of a given sample, with the first item being the sample name. (C) The example of a label file. The first column gives the sample names, and the class label of each sample is given in the column titled label. The 0 value in the label column means this sample is alive, 1 means dead. (D) the outputs of mqTrans. Please click here to view a larger version of this figure.

Figure 5: Run the feature selection algorithm for the mqTrans feature. The results of the feature selection algorithm are shown to the user. Please click here to view a larger version of this figure.

Figure 6: The maximum test set accuracy of each feature selection algorithm. The horizontal axis lists the feature selection algorithms, and the vertical axis gives the values of accuracies. The histograms show the experimental data of the three settings, i.e., mqTrans, mRNA, mRNA+mqTrans. Please click here to view a larger version of this figure.

Figure 7: Top 50 dark biomarkers with the smallest p-values in the mqTrans view. The column "Dark Biomarker" gives the dark biomarker names. The columns "mRNA.P" and "mqTrans.P" are the statistical t-test p-values between the phenotypic and control groups. The background colors of the p-values are colored between the p-values 1.00 (blue) and 0.00 (red), and the white color represents p-value = 0.05. Please click here to view a larger version of this figure.

Figure 8: The details of the 26 cancers in The Cancer Genome Atlas (TCGA) at different stages. The columns "Cohort" and "Disease Tissue" describe the patient group and the tissues with disease for each dataset. The last four columns give the numbers of samples in the developmental stages I, II, III, and IV, respectively. Please click here to view a larger version of this figure.

Figure 9: The numbers of dark biomarkers and traditional biomarkers in 26 cancers. The horizontal axis lists the 26 cancer types. The vertical axis gives the numbers of dark biomarkers and traditional biomarkers for these cancer types. Please click here to view a larger version of this figure.

Supplementary Coding File 1: HealthModel-mqTrans-v1-00.tar Please click here to download this File.

Subscription Required. Please recommend JoVE to your librarian.

Discussion

Section 2 (Use the pre-trained HealthModel to generate the mqTrans features) of the protocol is the most critical step within this protocol. After preparing the computational working environment in section 1, section 2 generates the mqTrans view of a transcriptomic dataset based on the pre-trained large reference model. Section 3 is a demonstrative example of selecting the generated mqTrans features for biomarker detections and prediction tasks. The users can conduct other transcriptomic analyses on this mqTrans dataset using their own tools or codes.

The original HealthModel framework can further refine the pre-trained HealthModel using the multitask architecture, as described in²⁶. This protocol focuses on the utilization of the pre-trained reference model to generate the mqTrans view of a transcriptomic dataset.

The default pre-trained reference model was established on the healthy samples and may not be a good choice for some specific tasks, e.g., the investigation between the primary and metastatic cancers. The computational speed is also slow for a large transcriptomic dataset.

The significance of this protocol is to provide a complementary mqTrans view of the most abundantly-available OMIC data type, i.e., transcriptome. Dark biomarkers can be revealed from the undifferentially expressed genes ignored by the conventional transcriptomic analysis. A recent study detected seven dark biomarkers of metastatic colon cancer (mCC) based on three independent cohorts of 805 samples in total⁴⁴. Dark biomarkers received limited wet-lab investigations due to their undifferential expressions. However, one of the detected mCC dark biomarker YTHDC2 encodes the protein YTH domain containing 2, whose protein levels were observed to be positively correlated with the metastasis status of human gastric cancer cells⁴⁶ and colon cancers⁴⁷. Novel biological insights of dark biomarkers remain to be resolved through in vitro and in vivo technologies.

This protocol is designed to be fully modular. Reference models pre-trained on other large datasets like primary cancers will facilitate the investigation of tumor metastasis. This protocol will also be explored for applications in other life domains, including plants, fungi, and microbes.

The computational efficiency of this protocol is planned to be enhanced through parallelization and algorithmic optimization.

This protocol describes the procedure to transform a transcriptomic dataset to a new mqTrans view, and the transformed mqTrans values of a gene quantitatively measure the transcription regulation changes compared against the reference samples. A default model was pre-trained on the healthy transcriptomes and released as the reference HealthModel.

The source code of two downstream tasks is provided to facilitate the easy utilization of this protocol by biomedical researchers. The experimental data shows that the transformed mqTrans features could improve the prediction tasks using only the original expression levels. The mqTrans view can also unveil the latent phenotypic connections of some dark biomarkers without differential expressions in the original transcriptomic data.

Subscription Required. Please recommend JoVE to your librarian.

Disclosures

The authors have nothing to disclose.

Acknowledgments

This work was supported by the Senior and Junior Technological Innovation Team (20210509055RQ), Guizhou Provincial Science and Technology Projects (ZK2023-297), the Science and Technology Foundation of Health Commission of Guizhou Province (gzwkj2023-565), Science and Technology Project of Education Department of Jilin Province (JJKH20220245KJ and JJKH20220226SK), the National Natural Science Foundation of China (U19A2061), the Jilin Provincial Key Laboratory of Big Data Intelligent Computing (20180622002JC), and the Fundamental Research Funds for the Central Universities, JLU. We extend our sincerest appreciation to the review editor and the three anonymous reviewers for their constructive critiques, which have been instrumental in substantially enhancing the rigor and clarity of this protocol.

Materials

Name	Company	Catalog Number	Comments
Anaconda	Anaconda	version 2020.11	Python programming platform
Computer	N/A	N/A	Any general-purpose computers satisfy the requirement
GPU card	N/A	N/A	Any general-purpose GPU cards with the CUDA computing library
pytorch	Pytorch	version 1.13.1	Software
torch-geometric	Pytorch	version 2.2.0	Software