Here, we present a protocol to explore the biomarker and survival predictor of breast cancer based on the comprehensive analysis of pooled clinical datasets derived from a variety of publicly accessible databases, using the strategy of expression, correlation and survival analysis step by step.
Cite this ArticleCopy Citation | Download Citations | Reprints and Permissions
Chen, M. n., Zeng, D., Zheng, Z. q., Li, Z., Wu, J. l., Jin, J. y., Wang, H. j., Huang, C. z., Lin, H. y. Performing Data Mining And Integrative Analysis Of Biomarker in Breast Cancer Using Multiple Publicly Accessible Databases. J. Vis. Exp. (147), e59238, doi:10.3791/59238 (2019).
Translate text to:
In recent years, emerging databases were designed to lower the barriers for approaching the intricate cancer genomic datasets, thereby, facilitating investigators to analyze and interpret genes, samples and clinical data across different types of cancer. Herein, we describe a practical operation procedure, taking ID1 (Inhibitor of DNA binding proteins 1) as an example, to characterize the expression patterns of biomarker and survival predictors of breast cancer based on pooled clinical datasets derived from online accessible databases, including ONCOMINE, bcGenExMiner v4.0 (Breast cancer gene-expression miner v4.0), GOBO (Gene expression-based Outcome for Breast cancer Online), HPA (The human protein atlas), and Kaplan-Meier plotter. The analysis began with querying the expression pattern of the gene of interest (e.g., ID1) in cancerous samples vs. normal samples. Then, the correlation analysis between ID1 and clinicopathological characteristics in breast cancer was performed. Next, the expression profiles of ID1 was stratified according to different subgroups. Finally, the association between ID1 expression and survival outcome was analyzed. The operation procedure simplifies the concept to integrate multidimensional data types at the gene level from different databases and test hypotheses regarding recurrence and genomic context of gene alteration events in breast cancer. This method can improve the credibility and representativeness of the conclusions, thereby, present informative perspective on a gene of interest.
Breast cancer is a heterogeneous disease with diverse prognosis and treatment strategies in different molecular subtypes, in which the pathogenesis and development are probably associated with disparate molecular mechanisms1,2,3. However, identifying a therapeutic target usually takes years, or even decades, from initial discovery in basic research to clinical use4. Genome wide application of high-throughput sequencing technology for cancer genome has greatly advanced the process of searching for valuable biomarkers or therapeutic targets 5.
The overwhelming amount of cancer genomics data generated from the large-scale cancer genomics platforms, such as the ICGC (International Cancer Genome Consortium) and TCGA (The Cancer Genome Atlas), is posing a great challenge for researchers to perform data exploration, integration, and analytics, particularly for users lacking intensive training in informatics and computation6,7,8,9,10. In recent years, emerging databases, (e.g., ONCOMINE, bcGenExMiner v4.0, and Kaplan-Meier plotter, etc.) were designed and developed to lower the bar for approaching the intricate cancer genomic datasets, thereby, facilitating investigators to analyze and interpret the genes, samples and clinical data across various types of cancer11. The goal of this protocol is to describe a research strategy that integrated with multiple levels of gene information from a series of open access databases, which have been widely recognized by a great number of researchers, to identify the potential biomarkers and prognostic factors for breast cancer.
The ONCOMINE database is a web-based data-mining platform with cancer microarray information and is designed to facilitate discovery of novel biomarkers and therapeutic targets11. Currently, there are more than 48 million gene expression measurements from 65 gene expression datasets in this database11,12. The bcGenExMiner v4.0 (a free tool for non-profit institution), also called breast cancer Gene-Expression Miner, is a user-friendly web-based application comprising DNA microarrays results of 3,414 recovered breast cancer patients and 1,209 experienced a pejorative event13. It is designed to improve gene prognostic analysis performance with R statistical software and packages.
The GOBO is a multifunctional user-friendly online tool with microarrays information (e.g., Affymetrix U133A) from a 51-sample breast cancer cell line set and an 1881-sample breast tumor data set, that allows a wide array of analyses14. There are a variety of applications available in the GOBO database, which include rapid analysis of gene expression profiles in different molecular subtypes of breast tumors and cell lines, screening for co-expressed genes for creation of potential metagenes, and correlation analysis between outcome and gene expression levels of single genes, sets of genes, or gene signatures in breast cancer data set15.
The Human Protein Atlas is an open-access program designed for scientists to explore human proteome, which has already contributed to a large number of publications in the field of human biology and disease. The Human Protein Atlas is recognized as a European core resource for life science community16,17.
The Kaplan Meier plotter is an online tool integrating gene expression and clinical data simultaneously that allows assessment of the prognostic effect of 54,675 genes based on 10,461 cancer samples, which include 1,065 gastric, 2,437 lung, 1,816 ovarian and 5,143 breast cancer patients with a mean follow-up of 33/49/40/69 months18. Information of gene expression, relapse-free survival (RFS) and overall survival (OS) are downloadable from this database19,20.
Here, we describe a practical operation procedure of using multiple publicly accessible databases to compare, analyze and visualize patterns of alterations in the expression of the gene of interest across multiple cancer studies, with the goal of summarizing the expression profiles, prognostic values and potential biological functions in breast cancer. For example, recent studies have indicated the oncogenic properties of ID proteins in tumors and were associated with malignant features, including cellular transformation, immortalization, enhanced proliferation and metastasis21,22,23. However, each member of the ID family plays distinct roles in different types of solid tumors, and their role in breast cancer remains unclear24. In previous studies, explored through this method, we found that ID1 was a meaningful prognostic indicator in breast cancer25. Therefore, the protocol will take ID1 as an example to introduce the data mining methods.
The analysis starts from querying the expression pattern of the gene of interest in cancerous samples vs. normal samples in ONCOMINE. Then, the expression correlation of genes of interest in breast cancer was performed using the bc-GenExMiner v4.0, GOBO, and ONCOMINE. Next, the expression profiles of ID1 was stratified according to different subgroups using the above three databases. Finally, the association between ID1 expression and survival out was analyzed using bc-GenExMiner v4.0, the human protein atlas, and Kaplan-Meier plotter. The operation procedure was shown as the flowchart in Figure 1.
1. Expression Pattern Analysis
- Go to the ONCOMINE web interface26.
- Obtain the relative expression levels of gene ID1 in various types of malignancies by typing ID1 to the Search Box.
- Select Analysis Type from the Primary Filters menu. Then, select Cancer vs. Normal Analysis, Breast Cancer vs. Normal Analysis.
- Select Gene Summary View from the OTHER VIEWS menu. Set the threshold of P-value at 0.01. Download the figures.
NOTE: The threshold of fold change is 2, as described in the previous study27.
2. Expression Correlation Analysis
- Go to the bc-GenExMiner v4.0 web interface28.
- Select CORRELATION from the ANALYSIS menu, press the EXHAUSTIVE button. Type ID1 to the search box. Press the Submit button and the Start analysis button.
NOTE: Default setting show expression correlation analysis of all patients, which can be more accurate in different subtypes of breast cancer by pressing the Molecule subtype filter.
3. Subgroup Analysis
- Subgroup analysis in bc-GenExMiner v4.0
- Go to the bc-GenExMiner v4.0 web interface28.
- Select EXPRESSION from the ANALYSIS menu, press the EXHAUSTIVE button. Type ID1 to the search box and press the Submit button and the Start analysis button.
- Click the Nodal status (LN) and Scarff Bloom & Richardson grade status (SBR) thumbnails to view full images. In the SBR images, press the button below to visualize the P-values of the figures. Download the figures.
- Subgroup analysis in Gene expression-based Outcome for Breast Cancer Online (GOBO)
- Go to the GOBO web interface14.
- Type Gene symbol of interest ID1 to the screen upload the gene set.
- Set the search range of Define gene/probe identifiers to Gene Symbol. Set All in Tumor selection. Select Node status and Grade stratified in the Multivariate parameters. Other items remain default. Submit the inquiry and download the figures.
4. Survival Analysis
- Survival analysis in bc-GenExMiner v4.0
- Go to the bc-GenExMiner v4.0 web interface28.
- Select PROGNOSTIC from the ANALYSIS menu, press the EXHAUSTIVE button. Type ID1 to the search box and press the Submit button and the Start analysis button.
- In the Exhaustive prognostic analysis, select Nm, ERm, MR in the Population and event criteria and press the Submit button to obtain more information. Press the Kaplan-Meier curve thumbnails to export the full graphs.
NOTE: N (+, -, m): nodal status (+: positive, -: negative, m: mixed); ER (+, -, m): oestrogen receptor status (+: positive, -: negative, m: mixed); MR: metastatic relapse
- Survival analysis in The Human Protein Atlas (HPA)
- Go to the Human Protein Atlas web interface29.
- Type ID1 to the search box and click the Search button. Select Pathology sub-atlas.
NOTE: The mRNA expression levels across the 17 cancer types are shown in the RNA Expression overview section. Every cancer tissue label of the box plot is clickable to access a detailed page providing survival analysis data and RNA expression levels.
- Click the label of Breast Cancer, then the detailed page to show interactive survival scatter plot and survival analysis. Download the figures.
- Survival analysis in The Kaplan-Meier Plotter Survival
- Go to the Kaplan-Meier Plotter web interface30. Click Start KM plotter for breast cancer in the mRNA gene chip zone.
- Type ID1 to the search bar and select the green item in the candidate menu.
- Select RFS as survival type and Other items remain default. Click Draw Kaplan-Meier plot and download the figures.
NOTE: Settings of the survival types, cutoff types, and follow-up threshold, as well as probe set options, can be changed as required. Subgroup prognostic analysis including ER, PR, HER-2, lymph nodes, grade, Tp53 status, and molecular subtypes can be obtained via changing the setting in the Restrict analysis to subtypes box1. Likewise, the filter limitation of treatment could be set in Restrict analysis to selected cohorts’ box.
A representative result of data mining and integrative analysis of breast cancer biomarker was performed using ID1, one of the inhibitors of DNA-binding family members, which have been reported in the previous study 25.
As demonstrated in Figure 2, the differences of ID1 mRNA expression between tumor and normal tissues in multiple types of cancer were analyzed using the ONCOMINE database, which contained a total of 445 unique analyses. There were 5 studies which revealed that the mRNA expression level of ID1 was significantly higher in normal tissues than in breast cancer tissues. These data indicated the expression dysregulation of ID1 in breast cancer. Figure 3 showed the best positive and negative correlative genes of ID1 from the analysis performed in bc-GenExMiner v4.0. To identify the correlation between mRNA expression of ID1 and the clinicopathological parameters of BC patients, bc-GenExMiner v4.0 database was used the analysis. As shown in Figure 4, significantly increased mRNA level of ID1 was found in breast cancer patients without lymph node metastasis, as compared to those with lymph node metastasis (P=0.0005). Furthermore, the analysis in GOBO demonstrated that increased mRNA levels of ID1 were correlated to lower tumor grade (Figure 5, P<0.00001). These results implied that increased expression of ID1 was linked to lower metastatic potential and lower pathological grade in BC. The analysis from the bc-GenExMiner v4.0 database indicated that higher mRNA level of ID1 was correlated to longer distant metastasis-free survival (DMFS) in breast cancer patients (Figure 6, HR=0.82, 95% CI: 0.73-0.92, P=0.001). Consistently, analysis from The Human Protein Atlas suggested that elevated protein level of ID1 was associated with better survival outcome in breast cancer patients (Figure 7, P=0.0389). Survival analysis from the Kaplan-Meier Plotter also showed that higher mRNA level of ID1 expression predicted better recurrence-free survival (RFS) in breast cancer patients (Figure 8, HR=0.81, P=0.00023).
Figure 1. Overview of exploring the expression patterns and prognostic values of distinct breast cancer biomarkers and online databases selection. Systematic analysis of distinct breast cancer biomarkers was performed step by step in a variety of databases. First, the expression pattern of the gene of interest in cancerous samples vs. normal samples. Then, the expression correlation of genes of interest in breast cancer was performed. Next, the expression profiles of ID1 was stratified according to different. Finally, the association between ID1 expression and survival out was analyzed. Please click here to view a larger version of this figure.
Figure 2. The mRNA expression pattern of the ID1 in different types of human cancer. The mRNA expression of ID1 analyzed with the ONCOMINE database. The graphic demonstrated the numbers of datasets with statistically significant mRNA overexpression (red) or downregulated expression (blue) of the target gene. The number in each cell represented the number of analyses that meet the threshold within those analysis and cancer types. The gene rank was analyzed by percentile of the target gene in the top of all genes measured in each research. Cell color was determined by the best gene rank percentile for the analyses within the cell. The P-value was set up at 0.01 and fold-change was defined as 2, as shown in the red frame. This figure has been modified from the previous study25. Please click here to view a larger version of this figure.
Figure 3. Gene correlation analysis of ID1 in bc-GenExMiner v4.0. The mRNA expression correlation of ID1 and relevant genes in 5, 696 breast cancer patients within 36 studies analyzed in bcGenExMiner v4.0. This figure has been modified from the previous study25. Please click here to view a larger version of this figure.
Figure 4. The relationship between ID1 expression and lymph node metastasis status. The mRNA expression level of ID1 in 4, 307 breast cancer patients with different lymph node (LN) status analyzed in bcGenExMiner v4.0. This figure has been modified from the previous study25. Please click here to view a larger version of this figure.
Figure 5. The relationship between the gene expression level of ID1 and tumor grade. The mRNA expression level of ID1 in breast cancer patients with different pathological grade was analyzed in GOBO. The global significant difference between groups was assessed to generate P-values and P<0.05 was considered to indicate a statistically significant difference. 1,2,3 in x-axis stand for sub-groups of patients in different pathological grade 1, grade 2, grade 3. This figure has been modified from the previous study 25. Please click here to view a larger version of this figure.
Figure 6. The prognostic values of ID1 for distant metastasis-free survival in breast cancer patients. The association between ID1 mRNA levels and distant metastasis-free survival estimates was analyzed in bcGenExMiner v4.0. This figure has been modified from the previous study25. Please click here to view a larger version of this figure.
Figure 7. The survival probability of ID1 in breast cancer patients. Impact of ID1 protein level for the survival of patients with breast cancer was analyzed in the human protein atlas (HPA). This figure has been modified from the previous study25. Please click here to view a larger version of this figure.
Figure 8. The prognostic values of ID1 in breast cancer according to recurrent-free survival (RFS). Different ID1 mRNA level in all 3, 951 breast cancer patients analyzed in Kaplan-Meier plotter. This figure has been modified from the previous study25. Please click here to view a larger version of this figure.
Comprehensive analysis of public databases may indicate the underlying function of the gene of interest and reveal the potential link between this gene and clinicopathological parameters in specific cancer27,31. The exploration and analysis based on one single database might provide limited or isolated perspectives due to the potential selection bias, or in a certain extent, possibly due to the variety of data quality, including data collection and the analytical algorithm of the database19. The most important step of this protocol is to select the appropriate databases, which should be widely recognized by a larger number of scientists with adequate representativeness. The investigator should use multiple databases to test the hypothesis and corroborate the results derived from different databases, rather than use a single database.
The protocol described here is an investigator friendly operation procedure. The advantage of this method is that it allows for the rapid visualization and interpretation of a gene’s potential role in breast cancer. Moreover, all the results obtained through this procedure can be immediately tested and repeated by simply querying the corresponding websites. The limitation of this method is that the conclusions which come from the comprehensive analysis of the databases may not exactly reflect the actual function or relationship in the clinical setting. This could stem from the systematical bias of the database, and in some case, possibly due to inadequate sample size32,33. Using more than one database to query the same research question could mutually confirm the results and increase the credibility of the conclusion34. It is strongly recommended to use samples from the investigator’s institution to verify the results, or if feasible, to perform related basic experiments to test the results.
More and more online cancer genomics or proteomics databases will be available and accessible for researchers35,36. The protocol might provide an efficient and economical method for the researcher to identify a potential target gene and the associated signaling pathway through in-depth analysis of online databases and by using genomics, transcriptomics, and epigenomics approach.
The authors have nothing to disclose
This work was partly supported by the Natural Science Foundation of Guangdong Province, China (No. 2018A030313562), the Teaching Reform Project of Guangdong Clinical Teaching Base (NO. 2016JDB092), National Natural Science Foundation of China (81600358), and Youth Innovative Talent Project of Colleges and Universities in Guangdong Province, China (NO. 2017KQNCX073)
|Microsoft||051690762553||We support and test the following browsers: Google Chrome, Firefox 3.0 and above, Safari, and Internet Explorer 9.0 and above|
|Adobe Flash player||Adobe Systems Inc.||It can be freely downloaded from http://get.adobe.com/flashplayer/.||This browser plug-in is required for visualizing networks on the network
|Chrome Broswer||Google Inc.||It can be freely downloaded from https://www.google.cn/chrome/||This is necessary for viewing PDF files including the Pathology Reports and many of
the downloadable files.
|Java Runtime Environment||Oracle Corporation||It can be downloaded from http://www.java.com/getjava/.|
|Office 365 ProPlus for Faculty||Microsoft||2003BFFD8117EA68||This is necessary for viewing the Pathology Reports and for viewing many of
the downloadable files.
|Vectr Online||Vectr Labs Inc.||It can be freely used from https://vectr.com/new||This is necessary for visualizing and editing many of
the downloadable files and pictures.
- van 't Veer, L. J., et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 415, (6871), 530-536 (2002).
- Loi, S., et al. Definition of clinically distinct molecular subtypes in estrogen receptor-positive breast carcinomas through genomic grade. Journal of Clinical Oncology. 25, (10), 1239-1246 (2007).
- Cancer Genome Atlas, N. Comprehensive molecular portraits of human breast tumours. Nature. 490, (7418), 61-70 (2012).
- Emerson, J. W., Dolled-Filhart, M., Harris, L., Rimm, D. L., Tuck, D. P. Quantitative assessment of tissue biomarkers and construction of a model to predict outcome in breast cancer using multiple imputation. Cancer Informatics. 7, 29-40 (2009).
- Yu, H., et al. Integrative genomic and transcriptomic analysis for pinpointing recurrent alterations of plant homeodomain genes and their clinical significance in breast cancer. Oncotarget. 8, (8), 13099-13115 (2017).
- He, W., et al. TCGA datasetbased construction and integrated analysis of aberrantly expressed long noncoding RNA mediated competing endogenous RNA network in gastric cancer. Oncology Reports. (2018).
- Liu, J., et al. An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics. Cell. 173, (2), e411 400-416 (2018).
- Esgueva, R., et al. Next-generation prostate cancer biobanking: toward a processing protocol amenable for the International Cancer Genome Consortium. Diagnostic Molecular Pathology. 21, (2), 61-68 (2012).
- Joly, Y., Dove, E. S., Knoppers, B. M., Bobrow, M., Chalmers, D. Data sharing in the post-genomic world: the experience of the International Cancer Genome Consortium (ICGC) Data Access Compliance Office (DACO). PLoS Computational Biology. 8, (7), e1002549 (2012).
- Zhang, J., et al. International Cancer Genome Consortium Data Portal--a one-stop shop for cancer genomics data. Database (Oxford). 2011, bar026 (2011).
- Rhodes, D. R., et al. ONCOMINE: a cancer microarray database and integrated data-mining platform. Neoplasia. 6, (1), 1-6 (2004).
- Rhodes, D. R., et al. Oncomine 3.0: genes, pathways, and networks in a collection of 18,000 cancer gene expression profiles. Neoplasia. 9, (2), 166-180 (2007).
- Jezequel, P., et al. bc-GenExMiner: an easy-to-use online platform for gene prognostic analyses in breast cancer. Breast Cancer Research and Treatment. 131, (3), 765-775 (2012).
- Available from: http://co.bmc.lu.se/gobo/gsa.plb (2018).
- Ringner, M., Fredlund, E., Hakkinen, J., Borg, A., Staaf, J. GOBO: gene expression-based outcome for breast cancer online. PLoS One. 6, (3), e17911 (2011).
- Ponten, F., Jirstrom, K., Uhlen, M. The Human Protein Atlas--a tool for pathology. Journal of Pathology. 216, (4), 387-393 (2008).
- Ponten, F., Schwenk, J. M., Asplund, A., Edqvist, P. H. The Human Protein Atlas as a proteomic resource for biomarker discovery. Journal of Internal Medicine. 270, (5), 428-446 (2011).
- Gyorffy, B., et al. An online survival analysis tool to rapidly assess the effect of 22,277 genes on breast cancer prognosis using microarray data of 1,809 patients. Breast Cancer Research and Treatment. 123, (3), 725-731 (2010).
- Stevinson, C., Lawlor, D. A. Searching multiple databases for systematic reviews: added value or diminishing returns? Complementary Therapies in Medicine. 12, (4), 228-232 (2004).
- Yin, J., et al. Integrating multiple genome annotation databases improves the interpretation of microarray gene expression data. BMC Genomics. 11, 50 (2010).
- Patel, D., Morton, D. J., Carey, J., Havrda, M. C., Chaudhary, J. Inhibitor of differentiation 4 (ID4): From development to cancer. Biochimica et Biophysica Acta. 1855, (1), 92-103 (2015).
- Kamalian, L., et al. Increased expression of Id family proteins in small cell lung cancer and its prognostic significance. Clinical Cancer Research. 14, (8), 2318-2325 (2008).
- Cruz-Rodriguez, N., et al. High expression of ID family and IGJ genes signature as predictor of low induction treatment response and worst survival in adult Hispanic patients with B-acute lymphoblastic leukemia. Journal of Experimental and Clinical Cancer Research. 35, 64 (2016).
- Yang, H. Y., et al. Expression and prognostic value of Id protein family in human breast carcinoma. Oncology Reports. 23, (2), 321-328 (2010).
- Zhou, X. L., et al. Prognostic values of the inhibitor of DNAbinding family members in breast cancer. Oncology Reports. 40, (4), 1897-1906 (2018).
- Available from: https://www.oncomine.org (2018).
- Lin, H. Y., Zeng, L., iang, Y. K., Wei, X. L., Chen, C. F. GATA3 and TRPS1 are distinct biomarkers and prognostic factors in breast cancer: database mining for GATA family members in malignancies. Oncotarget. 8, (21), 34750-34761 (2017).
- Available from: http://bcgenex.centregauducheau.fr/BCGEM/GEM-requete.php (2018).
- Available from: https://www.proteinatlas.org (2018).
- Available from: http://kmplot.com/analysis (2018).
- Zhu, Y. F., Dong, M. Expression of TUSC3 and its prognostic significance in colorectal cancer. Pathology-Research and Practice. 214, (9), 1497-1503 (2018).
- Nelson, J. C., et al. Validation sampling can reduce bias in health care database studies: an illustration using influenza vaccination effectiveness. Journal of Clinical Epidemiology. 66, (8 Suppl), S110-S121 (2013).
- Haibe-Kains, B., Desmedt, C., Sotiriou, C., Bontempi, G. A comparative study of survival models for breast cancer prognostication based on microarray data: does a single gene beat them all? Bioinformatics. 24, (19), 2200-2208 (2008).
- Yang, C., et al. Understanding genetic toxicity through data mining: the process of building knowledge by integrating multiple genetic toxicity databases. Toxicology Mechanisms and Methods. 18, (2-3), 277-295 (2008).
- Cannata, N., Merelli, E., Altman, R. B. Time to organize the bioinformatics resourceome. PLoS Computational Biology. 1, (7), e76 (2005).
- Wren, J. D., Bateman, A. Databases, data tombs and dust in the wind. Bioinformatics. 24, (19), 2127-2128 (2008).