RESEARCH
Peer reviewed scientific video journal
Video encyclopedia of advanced research methods
Visualizing science through experiment videos
EDUCATION
Video textbooks for undergraduate courses
Visual demonstrations of key scientific experiments
BUSINESS
Video textbooks for business education
OTHERS
Interactive video based quizzes for formative assessments
Products
RESEARCH
JoVE Journal
Peer reviewed scientific video journal
JoVE Encyclopedia of Experiments
Video encyclopedia of advanced research methods
EDUCATION
JoVE Core
Video textbooks for undergraduates
JoVE Science Education
Visual demonstrations of key scientific experiments
JoVE Lab Manual
Videos of experiments for undergraduate lab courses
BUSINESS
JoVE Business
Video textbooks for business education
Solutions
Language
English
Menu
Menu
Menu
Menu
A subscription to JoVE is required to view this content. Sign in or start your free trial.
Research Article
Yiqiang Da*1, Xuerui Chen*1, Lanxin Yu2, Yunxin Li2, Zixu Chen2, Yuyue Zhang2, Hualin Wang3, Zhuang Zhu3, Yuan Liu1, Yao Geng4
1The First Clinical Medical College,Nanjing Medical University, 2School of Pediatrics,Nanjing Medical University, 3Department of Neurology,The First Affiliated Hospital with Nanjing Medical University, 4Rehabilitation Medicine Center,The First Affiliated Hospital with Nanjing Medical University
Erratum Notice
Important: There has been an erratum issued for this article. View Erratum Notice
Retraction Notice
The article Assisted Selection of Biomarkers by Linear Discriminant Analysis Effect Size (LEfSe) in Microbiome Data (10.3791/61715) has been retracted by the journal upon the authors' request due to a conflict regarding the data and methodology. View Retraction Notice
Here, we present a protocol to analyze the role of a single gene thoroughly in bladder cancer (BLCA) based on transcriptome analysis and single-cell analysis, together with the utilization of 101 machine learning algorithms, to build a prognostic model for the mentioned single gene.
In this article, an analysis method is introduced based on public transcriptomic datasets and single-cell datasets, which could be used to comprehensively describe the role of single genes in tumors, including shaping the tumor immune microenvironment, shaping tumor molecular subtypes, and predicting the prognosis of tumor patients. At the same time, the introduction of single-gene data can not only avoid the randomness and heterogeneity brought about by the analysis of a single transcriptome but also allow for a deeper exploration of which specific clusters of cells the gene is expressed in, as well as further research into the role the gene plays within the pathway. Considering that many researchers may not be proficient in single-cell analysis, an online website is introduced in this method article, which is called TISCH2 (http://tisch.compbio.cn/), thus helping everyone to finish the single-cell analysis. In addition, the application of 101 machine learning methods has played an indispensable role in constructing the most accurate prognostic model. In conclusion, it is believed that this integrated single-gene analysis method that combines bioinformatics analysis, machine learning, and single-cell analysis can play an indispensable and crucial role in the study of the functions of single genes in tumor progression, as well as in the study of the functions of individual genes in pathways.
As we all know, bladder cancer (BLCA) is one of the most aggressive and metastatic malignant tumors in the world, and there are still issues with the current biomarkers for bladder cancer, such as inaccuracy and so on1. To identify the prognosis of BLCA patients and predict the outcomes of BLCA patients, the search for biomarkers for bladder cancer and the establishment of prognostic models are of great importance. Although people have developed some research methods for biomarkers, most of these methods are currently limited to transcriptomics, which inevitably leads to heterogeneity in samples2. Moreover, studying transcriptomics data alone often fails to investigate the roles played by different cell subpopulations, as well as the functions of genes in different pathways at the single-cell level, which makes past research less precise and effective3. Considering that single-cell analysis might be challenging for beginners, an online platform has been introduced for single-cell analysis to help them learn the skills quickly4. Last but not least, even if a single gene is found to be a suitable biomarker through transcriptome analysis and single-cell analysis, there is no guarantee that the biomarker will be applicable across all cohorts, so it is necessary to construct prognostic models associated with the biomarker to make the conclusions more universally applicable5. The 101 machine learning algorithm refers to the construction of 101 prognosis models using a combination of 10 different machine learning algorithms, with the goal of identifying the optimal prognosis model. The inclusion of algorithms such as random forest, XGBoost, and SVM, which are more capable of handling complex data and exhibit greater stability, has resulted in this combined algorithm demonstrating remarkable stability. In order to make this prognostic model more accurate and relevant to the biomarker, a correlation analysis is first conducted between all the genes and the biomarker, then select around 20-30 genes based on requirements, and subsequently use a 101 machine learning algorithm to build the prognostic model, followed by a series of analyses to finalize the results.
Compared to the biomarkers predicted by simple transcriptomics analysis of the past6, the biomarkers of single-cell and transcriptomics analysis allow for an unbiased breakdown of tissues or samples into their basic cellular components. It enables the clear differentiation of different cell types (such as T cells, B cells, and macrophages) and the discovery of new subpopulations (such as exhausted T cells and regulatory T cells), as well as the capture of continuous dynamic processes (such as cell differentiation trajectories)4. It is akin to arranging fruits and milk separately, allowing for a clear visualization of each component and its quantity. Compared with single-cohort Cox models7, the prognostic model constructed using 101 machine learning is also more accurate and scientifically sound, as it can automatically learn about the complex, non-linear interactions between variables from the data. For instance, the impact of a specific genetic mutation might only be significant in patients of certain ages and tumor sizes. Machine learning models such as random forests and neural networks can automatically capture these high-order interactions without the need for manual specification8. The signaling key applicability constraints that the dataset used for analysis must include the vast majority of genes, with the number of genes not being too low, and the number of genes used for constructing prognostic models not being too high, generally maintained around 20-30, being optimal. The quality control standard of the single cell set should be greater than 1000, the UMI count per cell should be greater than 1000, and the gene number per cell should be greater than 5009.
Here, a step-by-step approach is provided for the identification of novel biomarkers from public transcriptomic datasets and single-cell datasets, taking the role of TRPM4 in BLCA as an example. Multiple research methods are employed and diverse datasets -- including the Cancer Genome Atlas-Bladder Cancer (TCGA-BLCA) dataset, single-cell dataset GSE145281, and BLCA datasets GSE32894 and GSE31684 -- to enhance the precision and applicability of biomarkers in oncology from multiple perspectives and to advance the study of TRPM4 in BLCA. The TCGA-BLCA dataset was obtained from the University of California, Santa Cruz Xena (UCSC Xena) website. GSE32894 and GSE31684 were downloaded from the Gene Expression Omnibus (GEO), and single-cell data GSE145281 was sourced from Tumor Immune Single-Cell Hub 2 (TISCH2). Additionally, data on BLCA molecular subtypes and treatment responses were extracted from supplementary spreadsheet files of a relevant article. Bioinformatics analysis, single-cell analysis, and machine learning are then conducted, establishing an integrated single-gene research methodology.
NOTE: All the codes used in this article can be found on the website https://github.com/YaoGeng-nmu/Analysis-of-TRPM4-based-on-TCGA-data-and-single-cell-data-in-BLCA/blob/main/code.
1. Preparation of transcriptomics data
2. Preparation of single-cell analysis
3. Preparation of BLCA's molecular subtypes and response to therapeutic choices data
4. BLCA immune microenvironment analysis of TRPM4
5. Single-cell analysis of TRPM4 in the GSE145281 dataset
6. 101 Machine learning to construct a prognostic model relating to TRPM4
It is well-known that the core benefit of using transcriptomics to analyze the immune microenvironment for individual genes is that it allows for a direct correlation between individual genes and the dynamic interactions with immune cells and molecules at the level of gene expression, enabling the direct observation of immune characteristics and the identification of differences in the proportion of immune cells infiltrating between groups with high/low expression of individual genes or changes in the expression of immune checkpoint molecules18. The representative results of bioinformatics analysis are those related to the analysis of the immune microenvironment and immune molecules, such as violin plots, etc. The representative results of single-cell analysis are the target gene expression levels for each cell cluster and the pathway heatmap analysis for each cell cluster. The representative results of 101 machine learning models are the heatmaps of the algorithms used to build prognostic models.
First of all, the heatmap plot shows that 133 immunomodulators are more expressed in the low-TRPM4 group (Figure 1A). According to the calculation of ESTIMATE R package, it's obvious that stromalscore, immunescore, and estimatescore are all lower in the high-TRPM4 group, thus indicating that in samples with high TRPM4 expression, immune infiltration and interstitial infiltration are often insufficient (Figure 1B-D). In addition, the triangular correlation heatmap also shows that TRPM4 is negatively correlated with several immune checkpoints, such as VTCN1, CD274, PDCD1, CTLA4, HAVCR2, TIGIT, IDO1, CD80, CD86, LAIR1, PVR, CD200R1, CD200, LGALS3, CEACAM1, BTLA, ADORA2A, KIR3DL1, KLRC1, CD276, CD47, KLRD1 and LAG3 (Figure 1E). However, we completed the correlation plot between TRPM4 and T cell exhaustion genes, including CTLA4, LAG3, HAVCR2, CD3E, TIGIT, and PDCD1 (Figure 1F).
Although transcriptomics has clearly demonstrated the association between TRPM4 and TME, analysis at this level cannot distinguish which type or subpopulation of cells is responsible for these phenotypes. To precisely pinpoint the cellular origin of the immune phenotype and further clarify whether a specific type of immune cell has undergone a functional change, single-cell transcriptomics analysis is necessary. By analyzing the gene expression characteristics of individual cells, the immune phenotype can be accurately traced back to specific cell subpopulations. Additionally, GSE145281 has been introduced, a single-cell dataset of bladder cancer immune cells, to investigate the role of TRPM4 in the immune cell pathway. All 14462 immune cells are divided into 17 clusters (Figure 2A). All 14462 cells are divided into B cells, CD4 T cells, CD8 T cells, Mono, and NK cells (Figure 2B). Among all 14462 immune cells, 8465 of them are Mono cells, 2677 of them are CD8 T cells, 1771 are CD4 T cells, 566 of them are B cells, and 995 of them are NK cells (Figure 2C). We also describe the distribution of TRPM4 (Figure 2D). The results show that TRPM4 is predominantly expressed in Monocytes. The violin plot shows that TRPM4 is mostly expressed in the C1 and C2 clusters (Figure 2E). The results in Figure 2F also demonstrated the previous result. In addition, the cell chat analysis shows that cluster C1 is associated with cluster C0 and cluster C13, while cluster C2 is associated with cluster C0 and cluster C11 most (Figure 2G-H). The role of TRPM4 is also explored within the Hallmark gene set and depicted using a heatmap. The heatmap result shows that the C1 cluster is mostly enriched in the TNF-α pathway. This suggests that TRPM4 is likely involved in its effects through the TNF-α pathway (Figure 2I). The C2 cluster is mostly enriched in the xenobiotic metabolism pathway.
The single-cell signature analysis also shows that the two channels mentioned above are distributed in relation to TRPM4 (Figure 3A-B). The cell chat bubble plots are also made to describe the cell chat of the C1 and C2 clusters (Figure 3C-D). Lastly, the TF enrichment heatmap plot is generated (Figure 3E). The rank plots indicate that SPI1 could be a transcription factor that regulates gene expression in each cell population (Figure 3F-G). Through single-cell analysis, cell subpopulations have been precisely identified as associated with TRPM4 and clarified the potential mechanisms by which they influence disease progression. These findings provide a solid theoretical foundation for understanding the nature of the disease, but they have not yet been translated into tools that can directly guide clinical decision-making. To bring these insights into practice, further integration of machine learning is required to build prognostic models. This involves collecting data on genes that have a strong correlation with TRPM4 and ultimately developing a model that can reliably predict patient survival. All the genes highly related to TRPM4 are listed in the Supplementary Table 4. Last but not least, to further predict the prognosis of BLCA patients, 101 machine learning models are employed to better describe all patients' prognosis.
In order to avoid the randomness and inaccuracy brought about by relying on a single data source, we utilize multiple BLCA datasets, such as GSE32894 and GSE31684. The results showed that Coxboost+RSF could predict the BLCA patients' prognosis best, and the c-index of this prognostic model can be 0.879 (Figure 4).
In addition to TRPM4, the ESTIMATE R package has also been applied to analyze the stromalscore, immunescore, and ESTIMATEScore within different SDCBP expression groups (Supplementary Figure 1A-C). The triangular correlation heatmap is also made between SDCBP and immune checkpoints to illustrate how to adapt the protocol (Supplementary Figure 1D). The qPCR experiment and the results align with our bioinformatics analysis. Human BLCA cell line T24 and normal bladder cell line SVHUC1 were used. As shown in the result graph, cells with high TRPM4 expression also exhibit high levels of TRPM4 (Supplementary Figure 2).

Figure 1: Bioinformatics analysis of TRPM4 in shaping tumor microenvironment. (A) Heatmap plot of 133 immunomodulators in different TRPM4 groups. (B-D) Violin plots of stromalscore, immunescore and estimatescore in different TRPM4 groups. ****p ≤ 0.0001. (E) Triangular correlation heatmap between TRPM4 and immune checkpoints. (F) Correlation plots between T cell exhaustion genes and TRPM4. Please click here to view a larger version of this figure.

Figure 2: Single-cell analysis of TRPM4 using the GSE145281 dataset. (A) Different clusters of GSE145281. (B) Different immune cells of GSE145281. (C) Pie plot of GSE145281. (D) The distribution of TRPM4. (E-F) Violin plot of TRPM4 in different clusters and immune cells. (G-H) Cell chat analysis of C1 and C2 clusters. (I) Hallmark pathway analysis of different clusters. Please click here to view a larger version of this figure.

Figure 3: Single-cell analysis of TRPM4 using the GSE145281 dataset. (A-B) Single-cell signature analysis using the GSE145281 dataset. (C-D) Bubble plot of C1 and C2 clusters. (E-G) TF enrichment heatmap plots. Please click here to view a larger version of this figure.

Figure 4: 101 machine learning heatmap plot of prognostic model relating to TRPM4. The heatmap plot of different combinations of machine learning algorithms. Please click here to view a larger version of this figure.
Supplementary Figure 1: Bioinformatics analysis of SDCBP in shaping tumor microenvironment. (A-C) Violin plots of stromalscore, immunescore and estimatescore in different SDCBP groups. ****p ≤ 0.0001. (D) Triangular correlation heatmap between SDCBP and immune checkpoints. Please click here to download this File.
Supplementary Figure 2: Gene expression data in cell lines. Please click here to download this File.
Supplementary Table 1: All TCGA-BLCA patients' ID. Please click here to download this File.
Supplementary Table 2: Different versions of these packages used in this article. Please click here to download this File.
Supplementary Table 3: All marker genes used for TISCH2. Please click here to download this File.
Supplementary Table 4: All the genes correlated with TRPM4. Please click here to download this File.
In previous studies, bioinformatics analysis of single genes often encountered issues such as superficiality, inaccuracy, and limited applicability19. Moreover, previous research on single genes has typically been confined to transcriptome data. Moreover, when focusing solely on transcriptome data, issues such as sample heterogeneity and randomness may arise, making it difficult to identify universal patterns20. Therefore, to address the aforementioned issues, single-cell data have been introduced to better understand the crucial role of individual genes at the pathway level. The key step in this article is that an online website has been introduced called TISCH2 (http://tisch.compbio.cn/) for single-cell analysis4. The standard used for quality control is that the cell number per dataset is over 1000, the UMI count per cell is over 1000, and the gene number per cell is over 500. In addition, the 101 machine learning is also the key step used to find the best prognostic model to predict all patients' prognosis and clinical outcomes21. Last but not least, do remember to apply several bioinformatics analysis plots to describe the results of a single gene.
Of course, it is inevitable to encounter issues during the process of creating diagrams and performing analyses. If it is discovered that the genes output from the heatmap are few or almost nonexistent, it is essential to carefully check if there are any spaces after the gene IDs, as these spaces can significantly affect the output of the results. If the code frequently throws errors, a significant portion of the issue may stem from personal modifications that are not properly implemented, such as setting the path to a different location on the user's own computer. To resolve this, simply set the path to the location on the user's own machine. Finally, when building a prognostic model using 101 machine learning, patience is crucial. This process can take a considerable amount of time, lasting anywhere from half an hour to an hour, and it is important to ensure that all necessary R packages have been downloaded beforehand.
However, there are still several limitations of this method. Due to time limitations, we only validated this conclusion in BLCA through bioinformatics analysis instead of other cancer types. Moreover, the conclusion may not be applicable to other cancers, necessitating more in-depth and extensive research to provide a comprehensive explanation. The protocols can be richer and more fulfilling by adding more kinds of bioinformatics analysis. In addition, the bioinformatics analysis results should be integrated with experiments; for example, if we find a connection between TRPM4 and TNF-α pathways in the analysis, we can conduct PCR experiments to verify this connection, and the bioinformatics analysis results can serve as a guide for the direction of experimental verification.
In this method article, we also provide convenient download methods for a large amount of data, as well as introductions to single-cell online analysis websites, which greatly reduces the analysis burden for beginners and offers more diverse options for more people. There are also various bioinformatics codes that allow users to experiment with creating their own bioinformatics plots, such as a violin plot, a heatmap plot, a correlation plot, and a Triangular correlation heatmap plot22. Another significant advantage of the research method is that even if this gene is replaced, other single genes can still be studied in a similar manner, making the research findings universally applicable. In the future, we will increase the sample size by collecting data from multiple hospitals and extend the study to additional cancer types. This expansion aims to assess if findings hold true across diverse cancers. Concurrently, a robust prognostic model by integrating TRPM4 with established biomarkers like TMN stage, common in BLCA. This model is designed to enhance patient outcome prediction. Furthermore, machine learning techniques will be employed to build this model, focusing on improving its interpretability and predictive accuracy, which constitutes primary future work.
The authors declare that they have no conflict of interest.
We would like to thank the BioBean Informatics Consortium for developing an intelligent analytical framework (available at http://www.sxdyc.com/). Their innovative computational infrastructure substantially accelerated the research workflow through precision analytics and automated data interpretation modules.
| R 4.3.3 | none | none | none |