Analysis Based on TCGA Data and Single-cell Data, taking TRPM4 as an Example

Yiqiang Da; Xuerui Chen; Lanxin Yu; Yunxin Li; Zixu Chen; Yuyue Zhang; Hualin Wang; Zhuang Zhu; Yuan Liu; Yao Geng

doi:10.3791/69304

Method Article

Analysis Based on TCGA Data and Single-cell Data, taking TRPM4 as an Example

DOI:

10.3791/69304

⸱

December 5th, 2025

Yiqiang Da^*¹ , Xuerui Chen^*¹ , Lanxin Yu² , Yunxin Li² , Zixu Chen² , Yuyue Zhang² , Hualin Wang³ , Zhuang Zhu³ , Yuan Liu¹ , Yao Geng⁴

¹The First Clinical Medical College, Nanjing Medical University, ²School of Pediatrics, Nanjing Medical University, ³Department of Neurology, The First Affiliated Hospital with Nanjing Medical University, ⁴Rehabilitation Medicine Center, The First Affiliated Hospital with Nanjing Medical University

^* These authors contributed equally

Summary

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Here, we present a protocol to analyze the role of a single gene thoroughly in bladder cancer (BLCA) based on transcriptome analysis and single-cell analysis, together with the utilization of 101 machine learning algorithms, to build a prognostic model for the mentioned single gene.

Abstract

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

In this article, an analysis method is introduced based on public transcriptomic datasets and single-cell datasets, which could be used to comprehensively describe the role of single genes in tumors, including shaping the tumor immune microenvironment, shaping tumor molecular subtypes, and predicting the prognosis of tumor patients. At the same time, the introduction of single-gene data can not only avoid the randomness and heterogeneity brought about by the analysis of a single transcriptome but also allow for a deeper exploration of which specific clusters of cells the gene is expressed in, as well as further research into the role the gene plays within the pathway. Considering that many researchers may not be proficient in single-cell analysis, an online website is introduced in this method article, which is called TISCH2 (http://tisch.compbio.cn/), thus helping everyone to finish the single-cell analysis. In addition, the application of 101 machine learning methods has played an indispensable role in constructing the most accurate prognostic model. In conclusion, it is believed that this integrated single-gene analysis method that combines bioinformatics analysis, machine learning, and single-cell analysis can play an indispensable and crucial role in the study of the functions of single genes in tumor progression, as well as in the study of the functions of individual genes in pathways.

Introduction

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

As we all know, bladder cancer (BLCA) is one of the most aggressive and metastatic malignant tumors in the world, and there are still issues with the current biomarkers for bladder cancer, such as inaccuracy and so on¹. To identify the prognosis of BLCA patients and predict the outcomes of BLCA patients, the search for biomarkers for bladder cancer and the establishment of prognostic models are of great importance. Although people have developed some research methods for biomarkers, most of these methods are currently limited to transcriptomics, which inevitably leads to heterogeneity in samples². Moreover, studying transcriptomics data alone often fails to investigate the roles played by different cell subpopulations, as well as the functions of genes in different pathways at the single-cell level, which makes past research less precise and effective³. Considering that single-cell analysis might be challenging for beginners, an online platform has been introduced for single-cell analysis to help them learn the skills quickly⁴. Last but not least, even if a single gene is found to be a suitable biomarker through transcriptome analysis and single-cell analysis, there is no guarantee that the biomarker will be applicable across all cohorts, so it is necessary to construct prognostic models associated with the biomarker to make the conclusions more universally applicable⁵. The 101 machine learning algorithm refers to the construction of 101 prognosis models using a combination of 10 different machine learning algorithms, with the goal of identifying the optimal prognosis model. The inclusion of algorithms such as random forest, XGBoost, and SVM, which are more capable of handling complex data and exhibit greater stability, has resulted in this combined algorithm demonstrating remarkable stability. In order to make this prognostic model more accurate and relevant to the biomarker, a correlation analysis is first conducted between all the genes and the biomarker, then select around 20-30 genes based on requirements, and subsequently use a 101 machine learning algorithm to build the prognostic model, followed by a series of analyses to finalize the results.

Compared to the biomarkers predicted by simple transcriptomics analysis of the past⁶, the biomarkers of single-cell and transcriptomics analysis allow for an unbiased breakdown of tissues or samples into their basic cellular components. It enables the clear differentiation of different cell types (such as T cells, B cells, and macrophages) and the discovery of new subpopulations (such as exhausted T cells and regulatory T cells), as well as the capture of continuous dynamic processes (such as cell differentiation trajectories)⁴. It is akin to arranging fruits and milk separately, allowing for a clear visualization of each component and its quantity. Compared with single-cohort Cox models⁷, the prognostic model constructed using 101 machine learning is also more accurate and scientifically sound, as it can automatically learn about the complex, non-linear interactions between variables from the data. For instance, the impact of a specific genetic mutation might only be significant in patients of certain ages and tumor sizes. Machine learning models such as random forests and neural networks can automatically capture these high-order interactions without the need for manual specification⁸. The signaling key applicability constraints that the dataset used for analysis must include the vast majority of genes, with the number of genes not being too low, and the number of genes used for constructing prognostic models not being too high, generally maintained around 20-30, being optimal. The quality control standard of the single cell set should be greater than 1000, the UMI count per cell should be greater than 1000, and the gene number per cell should be greater than 500⁹.

Here, a step-by-step approach is provided for the identification of novel biomarkers from public transcriptomic datasets and single-cell datasets, taking the role of TRPM4 in BLCA as an example. Multiple research methods are employed and diverse datasets -- including the Cancer Genome Atlas-Bladder Cancer (TCGA-BLCA) dataset, single-cell dataset GSE145281, and BLCA datasets GSE32894 and GSE31684 -- to enhance the precision and applicability of biomarkers in oncology from multiple perspectives and to advance the study of TRPM4 in BLCA. The TCGA-BLCA dataset was obtained from the University of California, Santa Cruz Xena (UCSC Xena) website. GSE32894 and GSE31684 were downloaded from the Gene Expression Omnibus (GEO), and single-cell data GSE145281 was sourced from Tumor Immune Single-Cell Hub 2 (TISCH2). Additionally, data on BLCA molecular subtypes and treatment responses were extracted from supplementary spreadsheet files of a relevant article. Bioinformatics analysis, single-cell analysis, and machine learning are then conducted, establishing an integrated single-gene research methodology.

Access restricted. Please log in or start a trial to view this content.

Protocol

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

NOTE: All the codes used in this article can be found on the website https://github.com/YaoGeng-nmu/Analysis-of-TRPM4-based-on-TCGA-data-and-single-cell-data-in-BLCA/blob/main/code.

1. Preparation of transcriptomics data

Preparation of the TCGA-BLCA dataset
1. Download all the TCGA datasets from the website UCSC Xena (https://xenabrowser.net/datapages/)¹⁰. The datasets downloaded are pre-processed, eliminating the need for additional work such as gene annotation¹¹.
2. Select the TCGA dataset and then click on the TCGA-BLCA dataset. Then, go to the page of the TCGA-BLCA dataset, click on Gene Expression RNA-seq, and download the clinical data and gene expression profiles from the TCGA-BLCA dataset. In this dataset, ensure that there are 400 BLCA patients' mRNA expression data.
3. To differentiate tumor samples from adjacent normal tissue samples, separate the samples with the endings 01 and 11 as samples with 01 indicate tumor samples, while the samples with 11 indicate cancer-adjacent tissues (normal samples).
Preparation of GSE32894 and GSE31684
1. Click on the GEO Website (https://www.ncbi.nlm.nih.gov) in order to download BLCA dataset GSE32894 and GSE31684¹²^,¹³.
2. After clicking on the Respective Entries in the upper-right corner, make sure to move to the pages for datasets GSE32894 and GSE31684.
3. Next, download the respective mRNA data, clinical data, and survival data for GSE32894. Click on the http button of the GSE32894 dataset. Also, download the GPL6947 platform data using the platform button. Click on the Series Matrix File(s) button and download GSE32894 after clicking on the GSE32894_series_matrix.txt.gz button when downloading mRNA data, clinical data, and survival data for GSE32894.
4. Download the respective mRNA data, clinical data, and survival data for GSE31684. Click on the http button of the GSE31684 dataset. Also, download the GPL570 platform data using the platform button. Click on the Series Matrix File(s) button and download GSE31684 after clicking on the GSE31684_series_matrix.txt.gz button when downloading mRNA data, clinical data, and survival data for GSE31684.
5. In the filtering step, remove meaningless genes, such as those with an expression level of zero or close to zero. Employ the ComBat method, which is based on the sva package¹⁴, for batch correction.

2. Preparation of single-cell analysis

Preparation of GSE145281
1. To conduct research at the single-cell level, find the BLCA single-cell dataset, considering that transcriptome data alone is insufficient to determine which cell cluster expresses TRPM4 most significantly.
2. Here, go to TISCH2 (http://tisch.compbio.cn/), an online website for single-cell analysis of tumors.
3. According to experimental needs, select the Cancer Type needed, using BLCA as an example in this case.
4. Click on the BLCA button, choosing the dataset called GSE145281.

3. Preparation of BLCA's molecular subtypes and response to therapeutic choices data

Download the data for BLCA's molecular subtypes and response to therapeutic choices from the article referenced here¹⁵.
Open the PubMed website (https://pubmed.ncbi.nlm.nih.gov/). Search for the article and click on Supplementary Tables. There is a total of 18 files in this compressed file package.

4. BLCA immune microenvironment analysis of TRPM4

Heatmap plot of the expression of 133 immunomodulators in different TRPM4 groups
1. According to the expression of TRPM4, divide all BLCA samples into the high-TRPM4 group and the low-TRPM4 group.
2. Copy all the sample names and TRPM4 expression data separately, then sort them in descending order of TRPM4 expression, and categorize the first half of the samples into the High-TRPM4 group and the latter half into the Low-TRPM4 Group in the Group column according to TRPM4 expression. All TCGA-BLCA patients' IDs are listed in Supplementary Table 1.
3. Install R packages required for the transcriptomic data analysis. Make slight modifications to the provided code, such as changing the file location, and then run the code. Carefully check if the gene IDs and the IDs of the immunomodulators are correctly matched, ensuring that there are no spaces or other inconspicuous factors after the gene IDs, as these are the primary causes of failed heat map generation.
Stromal and immunological analysis of tumor samples using the ESTIMATE R package
1. Make sure that all R packages required for the transcriptomic data analysis are installed. According to the code requirements, name the required data with the corresponding names, and then click the Run button.
  NOTE: It is important to remember that this website also offers online analysis capabilities. While it is possible to obtain the desired results by visiting the website (https://bioinformatics.mdanderson.org/estimate/rpackage.html), this method is not suitable for analyzing data that has been self-collected.
2. Click on the Disease button and choose the Bladder Urothelial Carcinoma option. Then, select the RNA-seq-V2 platform button. Download all samples' stromal score, immune score, and estimate score.
Violin plot of the expression between different TRPM4 groups
1. Install all R packages required for transcriptomic data analysis. All the R packages needed are listed in Supplementary Table 2. According to the code requirements, name the required data with the corresponding names, and then click the Run button.
2. Make sure to carefully verify that the gene IDs are entered correctly, ensuring there are no spaces or other inconspicuous factors at the end of the gene IDs, as these are the primary causes of failures in the creation of violin plots.
Triangle heatmap plot between different TRPM4 groups
1. Install all R packages required for the transcriptomic data analysis. Make sure to prepare the expression data of several immune checkpoint genes, such as TIGIT and CD80. Open the TCGA-BLCA transcriptome data and then search through each entry individually, given that the number of immune checkpoint genes is not extensive.
2. According to the code requirements, name the required data with the corresponding names, and then click the Run button.
Correlation plot between different genes and TRPM4
1. Install all R packages required for the transcriptomic data analysis. Fill in TRPM4 as the first gene and add the gene being researched as the second gene, which is CD3E in this case.
2. After clicking the Run button, make sure to change CD3E to other genes that need to be analyzed, such as CTLA4.

5. Single-cell analysis of TRPM4 in the GSE145281 dataset

BLCA GSE145281 dataset analysis using single-cell analysis
1. Enter the website and click on BLCA. Single-cell analysis allows for a more in-depth exploration of which cell cluster exhibits high expression of TRPM4 compared to transcriptomics analysis. Make sure that the single-cell analysis data is collected from GSE130001 and check against the quality control standard that the cell number per dataset should be greater than 1000, the UMI count per cell should be greater than 1000, and the gene number per cell should be greater than 500. The differentially expressed genes of each cluster compared to all other cells are identified based on the log-transformed fold change (|logFC| >= 0.25), and the clustering resolution should be 0.5.
Cell annotation for different cell clusters
1. Make sure that all cell clusters are annotated. For example, the B cells can be annotated by CD19, MS4A1, CD27, IGHD, IGHM, TCL1A, and FCRL5. All the genes used for cell annotation can be found in Supplementary Table 3.
2. Then, choose GSE145281 and download all results of TRPM4. Click on the Overall button and download the UMAP plot of GSE145281 of different clusters and cell types.
3. Click on the Gene button and choose the single gene that is going to be analyzed. Click on the GSEA button and observe the functions of this single gene across various pathways, such as the KEGG pathway and Hallmark pathway¹⁶^,¹⁷.
4. Click on the CCI button and find the cell chat plot and cell chat bubble plot of the cluster highly expressing a single gene.
5. Click on the TF enrichment button to see the enriched Transcript Factor for each cluster.

6. 101 Machine learning to construct a prognostic model relating to TRPM4

Install all R packages required for transcriptomic data analysis. Considering that a single machine learning approach might not always be the optimal method for predicting prognosis, a prognosis model can be constructed using 101 machine learning techniques. Conduct a correlation analysis between all the genes in the TCGA-BLCA cohort and GSE32894 and GSE31684 cohorts and TRPM4.
Based on the number of genes strongly associated with TRPM4, select all genes with a correlation greater than or equal to 0.6 or less than or equal to -0.6 with TRPM4. Select all genes strongly associated with TRPM4 in all cohorts.
Combine all the data from GSE32894 and GSE31684 cohorts to form a validation set and use TCGA-BLCA as the training set.
According to the code requirements, name the required data with the corresponding names, and then click the Run button.

Access restricted. Please log in or start a trial to view this content.

Results

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

It is well-known that the core benefit of using transcriptomics to analyze the immune microenvironment for individual genes is that it allows for a direct correlation between individual genes and the dynamic interactions with immune cells and molecules at the level of gene expression, enabling the direct observation of immune characteristics and the identification of differences in the proportion of immune cells infiltrating between groups with high/low expression of individual genes or changes in the expression of immun...

Access restricted. Please log in or start a trial to view this content.

Discussion

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

In previous studies, bioinformatics analysis of single genes often encountered issues such as superficiality, inaccuracy, and limited applicability¹⁹. Moreover, previous research on single genes has typically been confined to transcriptome data. Moreover, when focusing solely on transcriptome data, issues such as sample heterogeneity and randomness may arise, making it difficult to identify universal patterns²⁰. Therefore, to address the aforementioned issues, single-cell d...

Access restricted. Please log in or start a trial to view this content.

Disclosures

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

The authors declare that they have no conflict of interest.

Acknowledgements

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

We would like to thank the BioBean Informatics Consortium for developing an intelligent analytical framework (available at http://www.sxdyc.com/). Their innovative computational infrastructure substantially accelerated the research workflow through precision analytics and automated data interpretation modules.

Access restricted. Please log in or start a trial to view this content.

Materials

List of materials used in this article
Name	Company	Catalog Number	Comments
R 4.3.3	none	none	none

References

$$\rightleftharpoonup{xx}$$ $$\longleftharp{xx}$$, $$\longrightharp{xx}$$,

Zhang, C., et al. Identification of multicohort-based predictive signature for NMIBC recurrence reveals SDCBP as a novel oncogene in bladder cancer. Ann Med. 57 (1), 2458211(2025).
Han, M. H., et al. Plasma GFAP and Amyloid Pathology Predict Cognitive Response to Multidomain Interventions in MCI. Aging Dis. , (2025).
Xie, S., et al. Towards Precision Aging Biology: Single-Cell Multi-Omics and Advanced AI-Driven Strategies. Aging Dis. , (2025).
Han, Y., et al. TISCH2: expanded datasets and new tools for single-cell transcriptome analyses of the tumor microenvironment. Nucleic Acids Res. 51 (D1), D1425-D1431 (2023).
Yao, Y., et al. Advances in prognostic models for osteosarcoma risk. Heliyon. 10 (7), e28493(2024).
Ding, X., et al. Glutamine metabolism reprogramming promotes bladder cancer progression via PYCR1: a multi-omics and functional validation study. J Transl Med. 23 (1), 1277(2025).
Zhu, T., et al. Methylparaben and propylparaben promote bladder cancer invasion via MMP2 and PPARG modulation. Ecotoxicol Environ Saf. 306, 119383(2025).
Xie, J. H., et al. Deciphering cutaneous melanoma prognosis through LDL metabolism: Single-cell transcriptomics analysis via 101 machine learning algorithms. Exp Dermatol. 33 (4), e15070(2024).
Sun, D., et al. TISCH: a comprehensive web resource enabling interactive single-cell transcriptome visualization of tumor microenvironment. Nucleic Acids Res. 49 (D1), D1420-D1430 (2021).
Cao, X., et al. D-Mannose Upregulates Testin via the NF-κB Pathway to Inhibit Breast Cancer Proliferation. J Biochem Mol Toxicol. 39 (8), e70398(2025).
Goldman, M. J., et al. Visualizing and interpreting cancer genomics data via the Xena platform. Nat Biotechnol. 38 (6), 675-678 (2020).
Yu, Q., et al. GREM1 may be a biological indicator and potential target of bladder cancer. Sci Rep. 14 (1), 23280(2024).
Wu, Q., et al. Membrane palmitoylated protein MPP1 inhibits immune escape by regulating the USP12/ CCL5 axis in urothelial carcinoma. Int Immunopharmacol. 146, 113802(2025).
Johnson, W. E., Li, C., Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 8 (1), 118-127 (2007).
Hu, J., et al. Siglec15 shapes a non-inflamed tumor microenvironment and predicts the molecular subtype in bladder cancer. Theranostics. 11 (7), 3089-3108 (2021).
Kanehisa, M., Sato, Y., Morishima, K. BlastKOALA and GhostKOALA: KEGG Tools for Functional Characterization of Genome and Metagenome Sequences. J Mol Biol. 428 (4), 726-731 (2016).
Munkley, J., et al. Hallmarks of glycosylation in cancer. Oncotarget. 7 (23), 35478-35489 (2016).
Da, Y., et al. A high stroma-tumor ratio is associated with an immunosuppressive tumor microenvironment and a poor prognosis in bladder cancer. Front Oncol. 15, 1604609(2025).
Chen, C., et al. Bioinformatics Methods for Mass Spectrometry-Based Proteomics Data Analysis. Int J Mol Sci. 21 (8), 2873(2020).
Jonauskaite, D., et al. Universal Patterns in Color-Emotion Associations Are Further Shaped by Linguistic and Geographic Proximity. Psychol Sci. 31 (10), 1245-1260 (2020).
Zhu, W., et al. Integrated machine learning identifies epithelial cell marker genes for improving outcomes and immunotherapy in prostate cancer. J Transl Med. 21 (1), 782(2023).
Zhao, J., et al. Bioinformatics prediction and experimental verification of key biomarkers for diabetic kidney disease based on transcriptome sequencing in mice. PeerJ. 10, e13932(2022).

Access restricted. Please log in or start a trial to view this content.

Reprints and Permissions

Request permission to reuse the text or figures of this JoVE article

Request Permission

Analysis Based on TCGA Data and Single-cell Data, taking TRPM4 as an Example

In This Article

Summary

Abstract

Introduction

Protocol

Results

Discussion

Disclosures

Acknowledgements

Materials

References

Reprints and Permissions

Tags

Related Articles