Waiting
Login processing...

Trial ends in Request Full Access Tell Your Colleague About Jove

Cancer Research

Development of Compendium for Esophageal Squamous Cell Carcinoma

Published: April 12, 2024 doi: 10.3791/65480

Abstract

Esophageal cancer (EC) ranks as the 8th most aggressive malignancy, and its treatment remains challenging due to the lack of biomarkers facilitating early detection. EC manifests in two major histological forms - adenocarcinoma (EAD) and squamous cell carcinoma (ESCC) - both exhibiting variations in incidence across geographically distinct populations. High-throughput technologies are transforming the understanding of diseases, including cancer. A significant challenge for the scientific community is dealing with scattered data in the literature. To address this, a simple pipeline is proposed for the analysis of publicly available microarray datasets and the collection of differentially regulated molecules between cancer and normal conditions. The pipeline can serve as a standard approach for differential gene expression analysis, identifying genes differentially expressed between cancer and normal tissues or among different cancer subtypes. The pipeline involves several steps, including Data preprocessing (involving quality control and normalization of raw gene expression data to remove technical variations between samples), Differential expression analysis (identifying genes differentially expressed between two or more groups of samples using statistical tests such as t-tests, ANOVA, or linear models), Functional analysis (using bioinformatics tools to identify enriched biological pathways and functions in differentially expressed genes), and Validation (involving validation using independent datasets or experimental methods such as qPCR or immunohistochemistry). Using this pipeline, a collection of differentially expressed molecules (DEMs) can be generated for any type of cancer, including esophageal cancer. This compendium can be utilized to identify potential biomarkers and drug targets for cancer and enhance understanding of the molecular mechanisms underlying the disease. Additionally, population-specific screening of esophageal cancer using this pipeline will help identify specific drug targets for distinct populations, leading to personalized treatments for the disease.

Introduction

It is alarming that EC is the eighth most common cancer worldwide and the sixth leading cause of death worldwide. China, India, and Iran have alarmingly high incidence and mortality rates. There are two main types of EC: esophageal adenocarcinoma (EAC or EAD), and esophageal squamous cell carcinoma (ESCC)1. EAC is more common in the Western world, whereas ESCC is more common in Eastern countries, especially China and Iran2. Several risk factors are associated with EC, including tobacco and alcohol use, obesity, and gastroesophageal reflux disease (GERD). Additionally, dietary factors such as lack of fruits and vegetables and consumption of hot drinks and foods are associated with ESCC risk in high-risk areas. Early diagnosis and treatment are important for improving the outcomes of patients with EC3,4. Therefore, it is important to raise awareness of the risk factors, signs, and symptoms of EC, and to encourage regular screening of high-risk individuals. Furthermore, efforts to address modifiable risk factors, such as tobacco and alcohol use and unhealthy dietary habits, may help reduce the incidence of EC. EAD occurs in the cells of mucus-producing glands in the lower part of the esophagus, near the stomach. It is often associated with GERD, in which stomach acid and contents return into the esophagus. In contrast, ESCC arises from flat, thin cells that line the upper part of the esophagus5. It is more common in areas where tobacco and alcohol use are widespread, such as China and Iran.

Among various conditions related to the esophagus, Barrett's esophagus (BE), a condition in which the lining of the esophagus is replaced by glandular cells, is a known precursor of EAC6. It is worth noting that BE can develop without GERD, but the presence of GERD increases the risk of developing BE by 3 to 5-fold. Additionally, the presence of BE increases the risk of developing EAC by 50-100 fold7. Furthermore, hot or spicy foods and liquids have been linked to ESCC, but not to EAC. Understanding the risk factors for EC is important for it's prevention and early detection. Efforts to address modifiable risk factors, such as tobacco use, alcohol consumption, obesity, and unhealthy dietary habits, may help reduce the incidence of EC. Furthermore, routine screening and surveillance for high-risk individuals, such as those with dysphagia, or BE, may improve outcomes by enabling early detection and treatment.

It is certainly true that omics-driven studies, including genomics, transcriptomics, proteomics, methylomics, miRNAomics, and metabolomics, have contributed greatly to our understanding of ECs, especially ESCC8,9,10,11,12,13. These studies have allowed the identification of novel biomarkers, potential therapeutic targets, and new pathways involved in the development and progression of ESCC. However, the data generated from these studies is scattered throughout the literature, making it difficult for the scientific community to access and use this information. Therefore, it is important to create a repository or database that compiles data obtained from high- or low-throughput studies on specific cancers. Such a package can be streamlined and made by implementing some basic guidelines. These guidelines include selecting relevant studies, extracting and organizing data from these studies, and ensuring data quality and consistency. In addition, the compendium should be updated regularly to include new studies and data as they become available. Researchers can use a single platform to retrieve and analyze data on a specific cancer by creating a compendium or database that combines data from different studies. This will help accelerate research efforts and ultimately lead to more effective treatments and better outcomes for cancer patients.

The development of the cancer compendium incorporates data from both low-throughput and high-throughput studies. This compendium will be a valuable resource for researchers looking to identify potential diagnostic or therapeutic targets for cancer. One way to build this collection is by reviewing microarray studies available in publicly accessible repositories such as Gene Expression Omnibus (GEO). Microarray studies can provide information about gene expression levels in cancer cells, and these data can be used to identify differentially expressed genes (DEGs) that may play a role in cancer development and progression.

However, it should be noted that different studies might have used different methods to analyze their data, which may have led to the identification of different DEGs. Therefore, it is important to carefully review each study and consider any potential bias or limitations when pooling data for the compendium. Once the data is gathered at a common platform, researchers can use it to identify potential molecular targets for further study. These include examining the expression of a particular gene in clinical samples or conducting mechanistic studies to understand how a particular gene or protein is involved in cancer development and progression. Overall, the creation of a cancer data set will be a valuable resource for cancer researchers and help identify new targets for diagnosis and therapeutic interventions.

Subscription Required. Please recommend JoVE to your librarian.

Protocol

1. Manual curation of the differentially regulated molecules in ESCC

  1. Finding relevant low-throughput studies using PubMed
    NOTE: It is important to understand the basic difference between low-throughput versus high-throughput techniques. In the former, only a limited number of samples are studied, and the process is usually time-consuming, on contrast later is faster and the number of samples can be analyzed in one go which is significantly higher than in low-throughput methods such as Northern blot, and Western blot are low-throughput techniques, while cDNA microarray, and LC-MS/MS based quantitative proteomics are high-throughput techniques14. The search engines like NCBI-PubMed (see Table of Materials) is a good source of finding studies relevant to any cancer, as it is a publically available resource for biomedical scientists to find literature on any diseases like ESCC. To find relevant studies, the following steps need to be followed.
    1. Click on the Google search engine (see Table of Materials) to open it. This is also be used by the researchers, as there are journals that are not indexed in PubMed, but are still good quality articles. It is always good to search more than one search engine to nullify the probability of missing any important article.
    2. Select search bar of the PubMed and use Boolean operators (AND, OR, NOT). Boolean operators refine the searchers to relevant keywords used by the researchers to find the results.

2. Finding relevant studies using PubMed

  1. Type NCBI web-address (see Table of Materials) and click to open it.
  2. Select PubMed from the left side bar menu All Databases tab.
  3. Type in the search bar, the relevant keywords, which will fetch the relevant articles. The keywords must be used in combination with the Boolean operators as these could help in getting the articles that are closely related to cancer / disease which is in question (e.g., esophageal squamous cell carcinoma).

3. Finding relevant studies using gene expression omnibus (GEO)

NOTE: Gene expression omnibus (GEO) is a freely available repository for storing data on DNA microarrays. The plethora of data available in GEO is a good resource for data mining to identify differentially regulated molecules between cancer/diseases versus normal conditions.

  1. Type GEO web address (see Table of Materials), and click to open it.
  2. Type Esophageal squamous cell carcinoma AND Homo sapiens and press Enter.
    NOTE: For making a compendium of differentially regulated genes for cancer, it is important to select the relevant studies from where the researchers will select the molecules that are differentially regulated between cancer vs. normal conditions. The primary aspect is to set the criteria for the selection of molecules based on fold-change. The fold-change will indicate if a gene/protein is upregulated or downregulated. The fold change and the p-value cut off values can significantly alter the meaning of the data in a given experiment including microarray RNAseq, or proteomics studies15.

4. Microarray analysis using GEO2R

NOTE: The first thing is to find relevant studies using Boolean operators (AND, OR, NOT). These will be used in combination with the keywords 'esophageal squamous cell carcinoma', 'ESCC', or 'oesophageal squamous cell carcinoma'. GEO2R (see Table of Materials) is a freely available R-language package that is integrated with GEO, enabling users to analyze data from microarray studies in a user-friendly manner. It interacts with GEO entry IDs and provides an interface for performing complex R-based analysis to identify DEGs using Bioconductor R packages for the back end. This package not only transforms the GEO data but also presents its output in form of .txt tables, which can be further modified according to the users' needs16. The GEO2R package presents genes in an order of statistical significance based on p-value, but the order can be sorted based on log2-fold change. Additionally, users can view gene expression profiles as GEO profile images. Unlike other analysis tools, GEO2R is independent of selected dataset records and can interrogate actual data submitted by the investigators directly. More than 90% of GEO studies can be analyzed using this method17. The workflow of GEO2R with steps involved in analysis of microarray data using GEO2R is shown in Figure 1.

  1. Open the GEO2R website.
  2. Enter the GSE accession using the GEO in the search space labeled as "GEO accession" and click on the button Set. Create labels cancer and then normal.
  3. Assign samples based on type by assigning first Cancer followed by Normal.
  4. Click on Analyze without changing any default parameters.
  5. Click on the full gene list and download data in.tsv file format. .tsv must be converted into .xlsx before processing further the data.
  6. Convert the log2 fold change into fold change by using the formula [=(2)^(Nx), where N is the position of the cell] in the excel.
  7. Further subject the data obtained in step 4.6 to get the DEGs by applying a false discovery rate (FDR) of 5% or adjusted p-value (adj. p-value) <0.05, and a fold change of >2.0-fold change for upregulated genes, and <0.5-fold change for downregulated genes.
    NOTE: FDR is a correction applied to p-values to account for multiple testing, and these two cannot be treated as synonymous to each other. FDR adjustment is crucial to control the rate of false discoveries when performing a large number of hypothesis tests.
  8. Identify the unique gene list as compared to the previously published study by subjecting the DEGs using an online freely available Venn Diagram generator tool (Pangloss) to generate the Venn Diagram for segregation of common and unique DEGs (see Table of Materials).
  9. Furthermore, subject the unique gene list to identify the Gene Ontology.
    NOTE: Different GO functions such as Gene Ontology: Molecular Function (GO: MF), Gene Ontology: Biological Processes (GO: BP), and Gene Ontology: Cellular Component (GO: CC) can be generated using either g: Profiler (see Table of Materials), or PANTHER (see Table of Materials), both are online freely available program. The GO-based analysis done using g: Profiler.
  10. For obtaining the distribution of DEGs in the genome on individual chromosomes, use ShinyGO (see Table of Materials).
  11. Paste the entire unique gene list obtained from GEO2R in the ShinyGO search tab, and the best matching species selected was Homo sapiens. The FDR cutoff used was 5%.
  12. Perform the selection of DEGs obtained from GEO2R for further validation study at protein level. The workflow for extracting the additional but crucial information on the DEGs has been shown in Figure 2.

5. Finding alias for a gene/protein

  1. Open the HPRD page (see Table of Materials), and click on the Query button.
  2. Enter the protein name or HPRD identifier in the respective search tab press the Search button at the bottom
    NOTE: There will be a page open which has a tab labeled as "ALTERNATE NAMES", click it and alias for this gene or protein will appear.

6. Finding official gene symbol for the DEGs

  1. To find official gene symbol, open the HUGO Gene Nomenclature Committee (HGNC) (see Table of Materials).
  2. Type the name obtained in the GEO2R analysis in the search bar of HGNC page and click on the search symbol. The search results will show to which official gene symbol, the term exactly matches.
  3. Click on that link and find out the mentioned official "gene symbol" there.

7. Finding gene locus of the DEGs

  1. Open the gene page by visiting the link on NCBI.
  2. Type the offical gene symbol of the gene and click on search.
  3. In the result carefully check the gene and the organism it belongs to. Once made sure the correct match, click on the link and proceed furher.
    NOTE: In summary, under the heading "Genomic context" there will be location which is actually gene locus.

8. Finding information about DEGs on OMIM Pagegene locus of the DEGs

  1. Open the OMIM page (see Table of Materials).
  2. Enter the name or the gene symbol of the gene information particularly "gene- phenotype".
  3. Click on the search button and the results will appear for the gene of interest. The display provides information about the "Gene-Phenotype Relationship".

9. Finding protein localization, domain, and motif, and secretory nature of the protein encoded by the gene

  1. Open the HPRD page (see Table of Materials).
  2. Once open, enter the protein name or HPRD indetifier in the respective search tab and press the Search button at the bottom.
  3. There will be a page open which has a tab labeled as "Summary". Go whole the way down on the page and check the localization tab to know the primary and alternate "secondary" localization.
    NOTE: On the same page, below "localization" there is a tab "Domains and Motifs" where domains and motifs for the protein of interest are listed (if any). Again on the same page, adjacent to "Domains and Motifs" tab there is a tab "Expression" with sub-tab "site of expression". Any protein which has been reported to be detetced in any biological fluid can be considered as "secretory" in nature. If protein of interest has been reported in any of the biological fluid such as plasma, serum, semen, tears extra, it will be listed there.

10. Cherry picking for the protein for validation, and further assessment for diagnosis or prognosis of the malignancy of interest

NOTE: Once unique molecules are identified, the biggest challenge is how to validate them. Usually, microarray study provides expression at the mRNA levels, but for disease diagnosis or prognosis, readout of protein levels is crucial. For the same, patients' or patients derived samples or cell lines of same cancer must be screened to know if the molecule is actually expressed there and if it is able to discriminate between cancer vs. normal, or good vs. bad prognosis, or differentiate between early to late stages of the diseases. To validate the candidate molecule, Western blot, enzyme-linked immunosorbent assays i.e., ELISA, immunoprecipitation, immunohistochemistry, immunocytochemistry, or assay are useful techniques18,19,20. At the same time, all these assays require, antibodies to detect the antigen present in the samples. Antibody is costly items, so it's always better to select antibodies based on the following points:

  1. Check if the antibody already reported any previously published research papers in other malignancies.
  2. If not, check out the antibody provider company and see if they provide a datasheet with a picture of the blot, or immunocytochemistry, or immunohistochemistry.
    NOTE: If one would like to do immunohistochemistry, but the antibody is available for ELISA assay only, this may be a risk to use the antibody specially if the target molecule is a transmembrane domain bearing molecule but not the signal peptide. Also, if the antibody is monoclonal or polyclonal, better to prefer a monoclonal antibody.

Subscription Required. Please recommend JoVE to your librarian.

Representative Results

As an example, GEO accession GSE161533 was used to study differentially explored genes in ESCC. The representative results of the analysis have been shown in the Figure 3. GEO2R generates a volcano plot that is useful for identifying events that differ significantly between two groups of experimental subjects. Volcano plot presents overall gene distribution with -log10 transformed significance (p-value) on the y-axis, and fold changes (with log2 transformed fold change) on the x-axis (Figure 3A), and it is useful for visualizing the genes which are differentially expressed. Highlighted genes are significantly differentially expressed at a default adj. p-value cut-off of 0.05 (blue = downregulated, red = upregulated).

A mean difference (MD) plot displays log2 fold change vs. average log2 expression values and is useful for visualizing the genes which are differentially expressed. In MD-plot, the genes where log2 transformed fold changes on the y-axis, and logs average value expression on the x-axis (Figure 3B). The highlighted genes are significantly differentially expressed at a default adj. p-value cutoff of 0.05 (blue = downregulated, red = upregulated). Volcano plots encounter the same issues as MA plots in terms of displaying information from only two treatments at once21.

Further, Uniform Manifold Approximation and Projection (UMAP)22 was used to assess the relatedness between ESCC and normal samples (Figure 3C). Though most of the samples were in the respective categories, two ESCC samples were found in the normal samples.

GEO2R presented a 2D interactive expression density plot (Figure 3D), which effectively demonstrated the density of expression in the dataset. This plot is useful for determining whether normalization is necessary for DEGs. In this plot, the y-axis denotes density, while the x-axis denotes intensity for both ESCC (green color) and normal (violet color).

The distribution of the values across different samples including ESCC and normal has been shown in the box plot. These distributions give hint if the samples are actually suitable for differential expression analysis. The median-centric values are clearly indicating that the data are normalized and cross-comparable (Figure 3E).

The identified genes are filtered based on p  < 0.05 and fold-change criteria. The unchanged genes (with fold-change between <2.0->0.50) removed from the analysis. Further, when compared with a previously published study, the common genes found are only 514, but the unique number of genes obtained is 1193. It is important to note as identifying unique genes using GEO2R can help not only decrease the redundancy but also enrich the compendium.

A partial list of DEGs has been mentioned in the Table 1, while a complete list of DEGs is provided in Supplementary File 1. Some of the upregulated genes belong to the extracellular matrix, such as MMP18,23,24, MMP1223,25, SPP18,26, POSTN9, and VCAN8,27. Among other genes that are listed in Table 1 include CMPK2, AURKA28,29, CHEK127, and CDK130 are upregulated and EMP127, PTK631,32, GPX327, DPT33, FHL134,35, and CRNN8,36 are downregulated in ESCC as compared with normal epithelia. POSTN (Periostin) has been upregulated in ESCC and also has been reported in the case of esophageal adenocarcinoma. A previous study on ESCC reported that POSTN protein expression was not only observed in the stromal region but also in the tumor cells suggesting that there is an interaction of between tumor-microenvironment9. Periostin is a protein that is primarily secreted by mesenchymal cells and plays a crucial role in the regulation, adhesion, and differentiation of osteoblasts, as well as in wound repair. In addition, periostin has been implicated in tumor progression and metastasis in various cancers, including ESCC. Studies have shown that periostin is involved in epithelial-to-mesenchymal transition (EMT) in cancers and tumor angiogenesis, promoting cell migration, motility, adhesion, and metastatic cell growth of tumors. In Barrett's esophagus, a precancerous condition of the esophagus, there is a significant upregulation of POSTN, the gene that codes for periostin, compared to normal esophageal tissue37. In eosinophilic esophagitis, an inflammatory disease of the esophagus, both periostin mRNA and protein expression levels are upregulated compared to normal esophageal epithelium. Similarly, in ESCC, POSTN was found to be 11-fold upregulated in gene expression analysis9. These findings suggest that POSTN may serve as a potential biomarker for ESCC and other cancers. Furthermore, the increased levels of serum POSTN are reported in breast cancer patients diagonalized with bone metastases, reflecting that POSTN could also be further investigated as a potential metastatic biomarker in the sera of ESCC patients. Overall, POSTN appear to play important roles in tumor progression and may have potential clinical implications for cancer diagnosis, prognosis, and treatment.

The chromosomal distribution of DEGs on individual chromosomes shows that the maximum numbers of genes were from chromosomes 1-6, and X (Figure 4). The ShinyGO-based pathway analysis showed that a number of crucial pathways pop up when DEGs analysis. Some of these were the IL-17 signaling pathway, protein digestion and absorption, ECM-receptor interaction, TNF-signaling pathway, Toll-like receptor signaling pathway, chemokine signaling pathway, cytokine-cytokine receptor interaction, alcohol liver disease, microRNAs in cancer, transcriptional dysregulation in cancer, cell cycle, and NONO-like receptor signaling pathway in ESCC. Further, enrichment of GO-terms in DEGs was done by using g: Profiler analysis. Different GO terms for molecular function (GO: MF), cellular components (GO: CC), and biological processes (GO: BP) were enriched (Figure 5). The list of these GO-terms has been provided in Table 2.

Figure 1
Figure 1: Schematic representation for processing of the studies on esophageal squamous cell carcinoma available in gene expression omnibus using GEO2R program. Different steps involved in the identification of differentially Regulated Genes (DEGs) or differentially regulated molecules (DEMs) have been shown in the schema including selection criteria for the DEGs based on the fold change >2.0-fold and p < 0.05 for upregulated, and <0.5 and p-value <0.05 for downregulated. Please click here to view a larger version of this figure.

Figure 2
Figure 2: Schematic representation for finding additional information on differentially regulated genes in esophageal squamous cell carcinoma available using other publicly available resources. Additionally, information DEGs is crucial in deciding on DEGs need to be selected for further validation and assessment in the clinical setting. Information such as extraction of alias, official gene symbol, chromosome location/gene locus, OMIM, domain/motif, secretory nature of the protein and availability of suitable antibody for validation at protein levels can be obtained from different online resources. Please click here to view a larger version of this figure.

Figure 3
Figure 3: Distribution of the study with GEO accession GSE161533 using GEO2R program for identification of DEGs between ESCC vs. normal. The GEO2R program was used with default parameters that give rise to (A) Volcano plot representing the gene distribution with -log10 transformed significance (p-value) on the y-axis, and fold changes (with log2 transformed fold change) on the x-axis, (B) MD-plot plot displaying log2 fold change vs. average log2 expression values for visualizing differentially expressed genes, (C) UMAP (Uniform Manifold Approximation and Projection) shows the segregation of samples based on their types, (D) Expression density plot complements as it checks the normalization of data before differential expression analysis, (E) Box plot showing median-centered values across the samples to indicate that the normalization of the data is cross comparable. Please click here to view a larger version of this figure.

Figure 4
Figure 4: Distribution of DEGs on different chromosomal loci using the ShinyGO enrichment tool. (A) Unique genes were identified by generating a Venn diagram to compare current vs. previously published studies. (B) The distribution of DEGs on the different chromosomes in the genome. (C) Pathway enrichment for DEGs using ShinyGO based enrichment analysis. Please click here to view a larger version of this figure.

Figure 5
Figure 5: Manhattan plots to illustrate GO term enrichments of target genes using g: Profiler. The differentially expressed genes were analyzed by g: Profiler and the enrichment in GO terms (MF: molecular function; BP: biological process; CC: cellular component) and KEGG pathways across Reactome pathways (REAC), WiKi-Pathways (WP), transcription factor (TF), and microRNA target base (MIRNA) were graphically depicted in Manhattan plot where the x-axis is the GO functional terms colored by category. Each colored dot represents a GO term. The y-axis shows the adjusted -log10p-values. The GO terms that are statistically significant for ESCC are shown on the x-axis. MF: Molecular Function; BP: Biological process; CC: Cellular component; MIRNA: MicroRNA; HP: Human Phenotype. Please click here to view a larger version of this figure.

Table 1: Partial list of differentially expressed genes in ESCC. Please click here to download this Table.

Table 2: Enrichment of GO-terms in ESCC using g: Profiler. Please click here to download this Table.

Supplementary File 1: Complete List of differentially expressed genes in ESCC. Please click here to download this File.

Subscription Required. Please recommend JoVE to your librarian.

Discussion

Since the involvement of high-throughput OMICS techniques in cancer biology, the rate of generation of data has been significantly increased. This poses a challenge for researchers especially those without a computer-savvy nature. To overcome over the years bioinformaticians come up with the idea of developing a database to provide data in an organized manner. This generated a positive response from researchers, especially those who are not interested in technology. Furthermore, scattered OMICS data here and there in the literature is of no use to anybody. Therefore, to make proper use of that there had always been a need for a common platform where researchers with specialized interests can go and access the data. There is a number of database on different cancer including ONCOMINE38, ESCC ATLAS39, pancreatic cancer database (PCD)40, and DDEC41.

The concept of differentially expressed genes (DEGs) arises from the analysis of RNA sequencing data, where genes that have significant changes in expression levels across two or more conditions (such as cancer vs. normal, or treatment vs. control) are identified. Several tools have been developed to determine DEGs, which perform statistical tests based on quantifications of the genes expressed evaluated from the computational analyses of either raw RNA-seq reads or intensity ratios generated between the probe and the target sequence in the cancer vs. normal group. These tools provide information related to the expression level and pairwise magnitude of difference for each gene. Differential gene expression (DGE) analyses are useful for understanding the genetic mechanisms that contribute to phenotypic differences in organisms. DGE analyses have been applied to study a variety of biological processes, including the tumor origin detection, and/or microbiome analysis. By identifying DEGs, this analyses can provide insight into the underlying genetic factors that contribute to these biological processes involved in ESCC tumorigenesis21.

The GEO2R tool method, which is publicly available and is the most preferred method because most of the studies available in the literature have been analyzed using different algorithms, which led to huge differences in data analysis; therefore, to avoid these differences, this user-friendly platform was used because it's free and easy to use. This allows comparisons between conditions such as 'Cancer vs. Normal' or 'Treatment vs. No Treatment'.

In this case, ESCC was chosen because it is an emerging cancer of the gastrointestinal (GI) tract in India, and China. We choose GEO accession GSE161533 to analyze using GEO2R to identify DEGs between ESCC vs. normal. The study was chosen because it did not include ESCC patients who had previously received chemotherapy or radiotherapy treatment. It is preferred to use paired samples if available (ESCC and adjacent normal from the same patient) for any analysis. This is because the genome of ESCC and normal tissues from the same patient is expected to be very similar since they come from the same genetic background and because the tissues are in the same environment. Using paired samples helps to avoid bias in the analysis that might be introduced if you were to compare ESCC and normal tissues from different patients with different genetic backgrounds. Using paired samples allows for a more accurate identification of the differences in gene expression between ESCC and normal tissues within the same patient, which can help improve the specificity of the results. This approach is frequently utilized in gene expression studies to control individual variability and enhance the analytical power.

We took all the sample data from the subjects involved in the study and used the GEO2R platform to analyze the gene expression data. First, we assigned cancer samples, followed by normal samples. After assigning these samples, The default parameters available in the GEO2R database were used to identify cancer or treatment samples and the normal or control samples. To differentiate between cancer and normal samples, an adjusted p-value (adj. P Val) threshold of less than 0.05 and a fold-change threshold of >2.0 was set for upregulated genes, and an adjusted p-value (adj. P Val) threshold of less than 0.05 less and a fold-change threshold of <0.5 for downregulated genes. These thresholds have been commonly used in gene expression studies to identify differentially expressed genes between cancer vs normal. It is important to note that the choice of thresholds for significance can affect the number and identity of genes identified as differentially expressed. Additionally, it is important to carefully evaluate the biological relevance of the identified genes and to perform further validation studies to confirm the results.

In the literature, there has been a trend of reporting only the genes with at least 2-fold change for upregulated, and <0.5-fold-change for downregulated genes especially in microarray and proteomics studies42. In earlier studies, a fold-change of >1.5-fold was considered as upregulated and <0.67-fold change for downregulated genes43,44, but literature trends in the last decade clearly show that higher fold-change is preferred largely because when validation experiments are performed on candidates with low fold-value those are either weak or no correlation found between mRNA and protein levels data45. There is a dark side to choosing higher fold change is that sometimes you miss some molecules that are biologically relevant in the disease or cancer, but just omitted due to the cutoff preferred to make the list of DEGs/DEMs. Furthermore, the literature is biased toward reporting that DEGs especially prefer upregulated or overexpressed molecules rather than underexpressed ones. Furthermore, if the expression of molecules conforms to the same patterns of upregulation or overexpression in multiple studies, regardless of whether they are for the same cancer or disease, it is a favored approach among scientists. Additionally, furthermore if the same pattern of overexpression is observed in multiple diseases and reported in literature, it is again widely accepted in the scientific community.

Moreover, the similarity of diseases is contingent on whether microarray data or literature is employed for the comparison. Lastly, loosely defined descriptions of differential expression magnitudes in the literature exhibit only a limited correlation with microarray fold-change data46.

Further, a compendium can provide additional information from databases such as NCBI Entrez gene47, HGNC48, OMIM49, HPRD50,51, Ensemble52, KEGG53, WikiPathways54, GO55, miRBase56, and DGV57. While using GEO2R, an assessment of UMAP shows how samples are related. In the current analysis, two ESCC samples populate with normal samples suggesting that either there is a sampling error or the ESCC samples are heterogeneous enough to show up in the group of normal samples.

GEO2R tool is user friendly and easily accessible, but it has some limitations. GEO2R lacks the ability to generate PCA plots and heat maps or filter samples after quality control. It only provides a single Venn diagram for sample comparisons within the same series. GEO2R is limited to Series Matrix files, preventing cross-series comparisons. Additionally, GEO2R only analyzes microarray data and does not have quality controls for sample normality or cross-comparability. GEO2R does not allow an unlimited number of search results and only displays the top 250 genes for any given pairwise comparison within a dataset. It also analyzes datasets with insufficient sample replicates for a robust statistical analysis. GEO2R provide data in log fold change, which required it to convert into fold-change either using r or in an excel sheet. Also, to represent up- and downregulated genes one has to use other another software or online tool to make heatmap58,59,60.

In summary, a simple pipeline is provided in this article, which can be used for making a compendium for any kind of malignancy with minor modifications. Compendium are need of the hour to support biomedical scientists especially for biomarker discovery by providing the candidate molecules for validation in the clinical setting for their usage either for prognosis or diagnosis.

Subscription Required. Please recommend JoVE to your librarian.

Disclosures

The authors have nothing to disclose.

Acknowledgments

MKK is recipient of the TARE fellowship (Grant # TAR/2018/001054) extramural grant (Grant # 5/13/55/2020/NCD-III) from the Science and Engineering Research Board (SERB), Department of Science and Technology, and the Indian Council of Medical Research (ICMR), Government of India, New Delhi, respectively.

Materials

Name Company Catalog Number Comments
NCBI-PUBMED NCBI https://ncbi.nlm.nih.gov/pubmed Referring to section 1. required for searching the literature
A laptop/macbook or personal computer with internet facility and a web browser.
g:Profiler ELIXIR infrastructure https://biit.cs.ut.ee/gprofiler/gost Referring to section 4.10. required for enrichment of GO:MF, GO:BP, and GO:CC
Gene expression omnibus NCBI https://www.ncbi.nlm.nih.gov/geo/ Referring to section 3.1. required for searching the microarray study database
GEO2R NCBI https://www.ncbi.nlm.nih.gov/geo/geo2r/ Referring to section 3.2. required for analyzing the data using GEO2R tool
Google Google https://www.google.com Referring to section 1.1. required for searching the literature
HGNC HGNC is a committee of the Human Genome Organisation (HUGO) https://www.genenames.org Referring to section 6.1 required to know the official gene symbol of the DEGs 
HPRD Institute of Bioinformatics, Bangluru  http://hprd.org Referring to section 5.1 required for informationn about protein architecture 
OMIM  Johns Hopkins University, Baltimore http://www.omim.org/entry Referring to section 8.1 required to know the OMIM ID of a particular gene / DEG
Pangloss Program Developed by Chris Seidel http://www.pangloss.com/seidel/Protocols/venn.cgi Referring to section 4.9. required for generating the Venn diagram
PANTHER Thomas lab at the University of Southern California http://www.pantherdb.org/geneListAnalysis.do Referring to section 4.10. required for enrichment of GO:MF, GO:BP, and GO:CC
ShinyGO  South Dakota State University http://bioinformatics.sdstate.edu/go Referring to section 4.10. required for allocation of DEGs on the chromosomes

DOWNLOAD MATERIALS LIST

References

  1. Zeng, H., et al. Esophageal cancer statistics in China, 2011: Estimates based on 177 cancer registries. Thorac Cancer. 7 (2), 232-237 (2016).
  2. Zhang, H., Jin, G., Shen, H. Epidemiologic differences in esophageal cancer between Asian and Western populations. Chin J Cancer. 31 (6), 281-286 (2012).
  3. Chen, C., et al. Consumption of hot beverages and foods and the risk of esophageal cancer: a meta-analysis of observational studies. BMC Cancer. 15, 449 (2005).
  4. Yousefi, M., et al. Esophageal cancer in the world: incidence, mortality and risk factors. Biomedical Research and Therapy. 5 (7), 2504-2517 (2018).
  5. Jemal, A., Center, M. M., DeSantis, C., Ward, E. M. Global patterns of cancer incidence and mortality rates and trends. Cancer Epidemiol Biomarkers Prev. 19 (8), 1893-1907 (2010).
  6. Kambhampati, S., Tieu, A. H., Luber, B., Wang, H., Meltzer, S. J. Risk factors for progression of barrett's esophagus to high grade dysplasia and esophageal adenocarcinoma. Sci Rep. 10 (1), 4899 (2020).
  7. Schuchert, M. J., Luketich, J. D. Management of Barrett's esophagus. Oncology (Williston Park). 21 (11), 1382-1389 (2007).
  8. Kashyap, M. K., et al. Genomewide mRNA profiling of esophageal squamous cell carcinoma for identification of cancer biomarkers. Cancer Biol Ther. 8 (1), 36-46 (2009).
  9. Kashyap, M. K., et al. Overexpression of periostin and lumican in esophageal squamous cell carcinoma. Cancers (Basel). 2 (1), 133-142 (2010).
  10. Zhu, Z. J., et al. Untargeted metabolomics analysis of esophageal squamous cell carcinoma discovers dysregulated metabolic pathways and potential diagnostic biomarkers. J Cancer. 11 (13), 3944-3954 (2020).
  11. Wang, H., et al. DNA methylation markers in esophageal cancer: an emerging tool for cancer surveillance and treatment. Am J Cancer Res. 11 (11), 5644-5658 (2021).
  12. Wu, B. L., et al. MiRNA profile in esophageal squamous cell carcinoma: downregulation of miR-143 and miR-145. World J Gastroenterol. 17 (1), 79-88 (2011).
  13. Meng, X. R., Lu, P., Mei, J. Z., Liu, G. J., Fan, Q. X. Expression analysis of miRNA and target mRNAs in esophageal cancer. Braz J Med Biol Res. 47 (9), 811-817 (2014).
  14. Churko, J. M., Mantalas, G. L., Snyder, M. P., Wu, J. C. Overview of high throughput sequencing technologies to elucidate molecular pathways in cardiovascular diseases. Circ Res. 112 (12), 1613-1623 (2013).
  15. Dalman, D. A., Nimishakavi, G., Duan, Z. H. Fold change and p-value cutoffs significantly alter microarray interpretations. BMC Bioinformatics. 13, 11 (2012).
  16. Gentleman, R. C., et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5 (10), 80 (2004).
  17. Barrett, T., et al. NCBI GEO: archive for high-throughput functional genomic data. Nucleic Acids Res. 37, D885-D890 (2009).
  18. Kume, H., et al. Discovery of colorectal cancer biomarker candidates by membrane proteomic analysis and subsequent verification using selected reaction monitoring (SRM) and tissue microarray (TMA) analysis. Mol Cell Proteomics. 13 (6), 1471-1484 (2014).
  19. Jin, G., Wong, S. T. C. Chapter 3 - Proteomics-Based Theranostics. , (2014).
  20. Del Campo, M., et al. Facilitating the validation of novel protein biomarkers for dementia: an optimal workflow for the development of sandwich immunoassays. Front Neurol. 6, 202 (2015).
  21. McDermaid, A., Monier, B., Zhao, J., Liu, B., Ma, Q. Interpretation of differential gene expression results of RNA-seq data: review and integration. Brief Bioinform. 20 (6), 2044-2054 (2019).
  22. McInnes, L., Healy, J., Saul, N., Großberger, L. UMAP: Uniform Manifold Approximation and Projection. Journal of Open Source Software. 3 (29), 861 (2018).
  23. Xu, G., et al. Upregulated expression of MMP family genes is associated with poor survival in patients with esophageal squamous cell carcinoma via regulation of proliferation and epithelial-mesenchymal transition. Oncol Rep. 44 (1), 29-42 (2020).
  24. Chen, Y. K., et al. Plasma matrix metalloproteinase 1 improves the detection and survival prediction of esophageal squamous cell carcinoma. Sci Rep. 6, 30057 (2016).
  25. Han, F., Zhang, S., Zhang, L., Hao, Q. The overexpression and predictive significance of MMP-12 in esophageal squamous cell carcinoma. Pathol Res Pract. 213 (12), 1519-1522 (2017).
  26. Kita, Y., et al. Expression of osteopontin in oesophageal squamous cell carcinoma. Br J Cancer. 95 (5), 634-638 (2006).
  27. Chen, F. F., Zhang, S. R., Peng, H., Chen, Y. Z., Cui, X. B. Integrative genomics analysis of hub genes and their relationship with prognosis and signaling pathways in esophageal squamous cell carcinoma. Mol Med Rep. 20 (4), 3649-3660 (2019).
  28. Tong, T., et al. Overexpression of Aurora-A contributes to malignant development of human esophageal squamous cell carcinoma. Clin Cancer Res. 10 (21), 7304-7310 (2004).
  29. Du, R., et al. Bioinformatics and experimental validation of an AURKA/TPX2 axis as a potential target in esophageal squamous cell carcinoma. Oncol Rep. 49 (6), 116 (2023).
  30. Zhang, H. J., et al. Overexpression of cyclin-dependent kinase 1 in esophageal squamous cell carcinoma and its clinical significance. FEBS Open Bio. 11 (11), 3126-3141 (2021).
  31. Ma, S., et al. Identification of PTK6, via RNA sequencing analysis, as a suppressor of esophageal squamous cell carcinoma. Gastroenterology. 143 (3), 675-686 (2012).
  32. Chen, Y. F., et al. Downregulated expression of PTK6 is correlated with poor survival in esophageal squamous cell carcinoma. Med Oncol. 31 (12), 317 (2014).
  33. Tao, Y., et al. Identification of distinct gene expression profiles between esophageal squamous cell carcinoma and adjacent normal epithelial tissues. Tohoku J Exp Med. 226 (4), 301-311 (2012).
  34. Kashyap, M. K., et al. Evaluation of protein expression pattern of stanniocalcin 2, insulin-like growth factor-binding protein 7, inhibin beta A and four and a half LIM domains 1 in esophageal squamous cell carcinoma. Cancer Biomark. 12 (1), 1-9 (2013).
  35. Wei, X., Zhang, H. Four and a half LIM domains protein 1 can be as a double-edged sword in cancer progression. Cancer Biol Med. 17 (2), 270-281 (2020).
  36. Pawar, H., et al. Downregulation of cornulin in esophageal squamous cell carcinoma. Acta Histochem. 115 (2), 89-99 (2013).
  37. Hao, Y., et al. Gene expression profiling reveals stromal genes expressed in common between Barrett's esophagus and adenocarcinoma. Gastroenterology. 131 (3), 925-933 (2006).
  38. Rhodes, D. R., et al. ONCOMINE: a cancer microarray database and integrated data-mining platform. Neoplasia. 6 (1), 1-6 (2004).
  39. Tungekar, A., et al. ESCC ATLAS: A population wide compendium of biomarkers for Esophageal Squamous Cell Carcinoma. Sci Rep. 8 (1), 12715 (2018).
  40. Thomas, J. K., et al. Pancreatic cancer database: an integrative resource for pancreatic cancer. Cancer Biol Ther. 15 (8), 963-967 (2014).
  41. Essack, M., et al. DDEC: Dragon database of genes implicated in esophageal cancer. BMC Cancer. 9, 219 (2009).
  42. Sharma, L., Kashyap, M. K., Sharma, D. Non-alcoholic Fatty Liver Disease (NAFLD): A systematic review and meta-analysis from an omics perspective. Gene Expression. 22 (2), 79-91 (2023).
  43. Mamber, S. W., Gurel, V., Rhodes, R. G., McMichael, J. Effects of Streptolysin O on extracellular matrix gene expression in normal human epidermal keratinocytes. Dose Response. 9 (4), 554-578 (2011).
  44. Pang, S., et al. Differential expression of long non-coding RNA and mRNA in children with Henoch-Schönlein purpura nephritis. Exp Ther Med. 17 (1), 621-632 (2019).
  45. Tan, P. K., et al. Evaluation of gene expression measurements from commercial microarray platforms. Nucleic Acids Res. 31 (19), 5676-5684 (2003).
  46. Rodriguez-Esteban, R., Jiang, X. Differential gene expression in disease: a comparison between high-throughput studies and the literature. BMC Med Genomics. 10 (1), 59 (2017).
  47. Maglott, D., Ostell, J., Pruitt, K. D., Tatusova, T. Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. 3535, D26-D31 (2007).
  48. Gray, K. A., Yates, B., Seal, R. L., Wright, M. W., Bruford, E. A. Genenames.org: the HGNC resources in 2015. Nucleic Acids Res. 43, Database issue D1079-D1085 (2015).
  49. McKusick, V. A. Mendelian Inheritance in Man and its online version, OMIM. Am J Hum Genet. 80 (4), 588-604 (2007).
  50. Keshava Prasad, T. S., et al. Human Protein Reference Database--2009 update. Nucleic Acids Res. 37, Database issue D767-D772 (2009).
  51. Peri, S., et al. Human protein reference database as a discovery resource for proteomics. Nucleic Acids Res. 32, Database issue D497-D501 (2004).
  52. Hubbard, T., et al. The Ensembl genome database project. Nucleic Acids Res. 30 (1), 38-41 (2002).
  53. Kanehisa, M., Goto, S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28 (1), 27-30 (2000).
  54. Pico, A. R., et al. WikiPathways: pathway editing for the people. PLoS Biol. 6 (7), 184 (2008).
  55. Ashburner, M., et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 25 (1), 25-29 (2000).
  56. Griffiths-Jones, S., Grocock, R. J., van Dongen, S., Bateman, A., Enright, A. J. miRBase: microRNA sequences, targets and gene nomenclature. Nucleic Acids Res. 34, Database issue D140-D144 (2006).
  57. MacDonald, J. R., Ziman, R., Yuen, R. K., Feuk, L., Scherer, S. W. The Database of Genomic Variants: a curated collection of structural variation in the human genome. Nucleic Acids Res. 42, Database issue D986-D992 (2014).
  58. Amaral, M. L., Erikson, G. A., Shokhirev, M. N. BART: bioinformatics array research tool. BMC Bioinformatics. 19 (296), 2018 (2018).
  59. Wiese, L., Wiese, I., Lietz, K. Software quality assessment of a web application for biomedical data analysis. 25th International Database Engineering & Applications Symposium. , 84-93 (2021).
  60. Davis, S., Meltzer, P. S. GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor. Bioinformatics. 23 (14), 1846-1847 (2007).

Tags

Cancer Research Esophageal squamous cell carcinoma DEGs GEO GEO2R transcriptomics and proteomics
This article has been published
Video Coming Soon
PDF DOI DOWNLOAD MATERIALS LIST

Cite this Article

Krishnia, L., Kashyap, M. K.More

Krishnia, L., Kashyap, M. K. Development of Compendium for Esophageal Squamous Cell Carcinoma. J. Vis. Exp. (206), e65480, doi:10.3791/65480 (2024).

Less
Copy Citation Download Citation Reprints and Permissions
View Video

Get cutting-edge science videos from JoVE sent straight to your inbox every month.

Waiting X
Simple Hit Counter