A detailed protocol of differential expression analysis methods for RNA sequencing was provided: limma, EdgeR, DESeq2.
RNA sequencing (RNA-seq) is one of the most widely used technologies in transcriptomics as it can reveal the relationship between the genetic alteration and complex biological processes and has great value in diagnostics, prognostics, and therapeutics of tumors. Differential analysis of RNA-seq data is crucial to identify aberrant transcriptions, and limma, EdgeR and DESeq2 are efficient tools for differential analysis. However, RNA-seq differential analysis requires certain skills with R language and the ability to choose an appropriate method, which is lacking in the curriculum of medical education.
Herein, we provide the detailed protocol to identify differentially expressed genes (DEGs) between cholangiocarcinoma (CHOL) and normal tissues through limma, DESeq2 and EdgeR, respectively, and the results are shown in volcano plots and Venn diagrams. The three protocols of limma, DESeq2 and EdgeR are similar but have different steps among the processes of the analysis. For example, a linear model is used for statistics in limma, while the negative binomial distribution is used in edgeR and DESeq2. Additionally, the normalized RNA-seq count data is necessary for EdgeR and limma but is not necessary for DESeq2.
Here, we provide a detailed protocol for three differential analysis methods: limma, EdgeR and DESeq2. The results of the three methods are partly overlapping. All three methods have their own advantages, and the choice of method only depends on the data.
RNA-sequencing (RNA-seq) is one of the most widely used technologies in transcriptomics with many advantages (e.g., high data reproducibility), and has dramatically increased our understanding of the functions and dynamics of complex biological processes1,2. Identification of aberrate transcripts under different biological context, which are also known as differentially expressed genes (DEGs), is a key step in RNA-seq analysis. RNA-seq makes it possible to get a deep understanding of pathogenesis related molecular mechanisms and biological functions. Therefore, differential analysis has been regarded as valuable for diagnostics, prognostics and therapeutics of tumors3,4,5. Currently, more open-source R/Bioconductor packages have been developed for RNA-seq differential expression analysis, particularly limma, DESeq2 and EdgeR1,6,7. However, differential analysis requires certain skills with R language and the ability to choose the appropriate method, which is lacking in the curriculum of medical education.
In this protocol, based on the cholangiocarcinoma (CHOL) RNA-seq count data extracted from The Cancer Genome Atlas (TCGA), three of the most known methods (limma8, EdgeR9 and DESeq210) were carried out, respectively, by the R program11 to identify the DEGs between CHOL and normal tissues. The three protocols of limma, EdgeR and DESeq2 are similar but have different steps among the processes of the analysis. For example, the normalized RNA-seq count data is necessary for EdgeR and limma8, 9, whereas DESeq2 uses its own library discrepancies to correct data instead of normalization10. Furthermore, edgeR is specifically suitable for RNA-seq data, while the limma is used for microarrays and RNA-seq. A linear model is adopted by limma to assess the DEGs12, while the statistics in edgeR are based on the negative binomial distributions, including empirical Bayes estimation, exact tests, generalized linear models and quasi-likelihood tests9.
In summary, we provide the detailed protocols of RNA-seq differential expression analysis by using limma, DESeq2 and EdgeR, respectively. By referring to this article, users can easily perform the RNA-seq differential analysis and choose the appropriate differential analysis methods for their data.
NOTE: Open the R-studio program and load R file "DEGs.R", the file can be acquired from Supplementary files/Scripts.
1. Downloading and pre-processing of data
2. Differential expression analysis through "limma"
3. Differential expression analysis through "edgeR"
4. Differential expression analysis through "DESeq2"
5. Venn diagram
There are various approaches to visualize the result of differential expression analysis, among which the volcano plot and Venn diagram are particularly used. limma identified 3323 DEGs between the CHOL and normal tissues with the |logFC|≥2 and adj.P.Val <0.05 as thresholds, among which 1880 were down-regulated in CHOL tissues and 1443 were up-regulated (Figure 1a). Meanwhile, edgeR identified the 1578 down-regulated DEGs and 3121 up-regulated DEGs (Figure 1b); DESeq2 identified the 1616 down-regulated DEGs and 2938 up-regulated DEGs (Figure 1c). Comparing the results of these three methods, 1431 up-regulated DEGs and 1531 down-regulated DEGs were overlapped (Figure 2).
Figure 1. Identification of differentially expressed genes (DEGs) between CHOL and normal tissues. (a-c) The volcano plots of all genes acquired by limma, edgeR and DESeq2, respectively, adj p value (-log10) is plotted against the fold change (log2), red points represent the up-regulated DEGs (adjusted p value<0.05 and log |FC|> 2) and the green points represent the down-regulated DEGs (adjusted p value< 0.05 and log |FC|< 2). Please click here to view a larger version of this figure.
Figure 2. Venn diagrams show overlap among the results derived from the limma, edgeR and DESeq2. Please click here to view a larger version of this figure.
Supplementary Files. Please click here to download this File.
Abundant aberrate transcripts in cancers can be easily identified by RNA-seq differential analysis5. However, the application of RNA-seq differential expression analysis is often restricted as it requires certain skills with R language and the capacity to choose appropriate methods. To address this problem, we provide a detailed introduction to the three most known methods (limma, EdgeR and DESeq2) and tutorials for applying the RNA-seq differential expression analysis. This will facilitate the understanding of the similarities and differences across all three methods, enable the selection of a suitable method for individual data, and enable us to understand the complex dynamic biological processes.
Here, we present a detailed protocol for RNA-seq differential expression analysis through limma, edgeR and DESeq2 respectively, in five stages: (i) downloading and pre-processing of data, (ii-iv) differential expression analysis through limma, edgeR and DESeq2, respectively, (v) comparison of the results of these three methods through a Venn diagram.
The three methods have similar and different steps among the processes of the differential expression analysis. A linear model is used for statistics in limma, which is applicable for all gene expression technologies, including microarrays, RNA-seq and quantitative PCR8,13, while edgeR and DESeq2 implement a range of statistical methodologies based on the negative binomial distribution9,10, and edgeR and DESeq2 are suitable for RNA-seq data. In addition, the normalized RNA-seq count data is necessary for EdgeR and limma, whereas DESeq2 uses its own library discrepancies to correct data instead of normalization and the data in DESeq2 must be an integer matrix. The normalization methods include TMM (trimmed mean of M-values), TMMwsp, RLE (relative log expression) and upperquartile, among which TMM is the most commonly used normalization method for RNA-seq data. The results of the three methods showed that DESeq2 and EdgeR obtain more DEGs than limma. The reason for this difference is that edgeR and DESeq2 are based on the negative binomial model, which contributes to large numbers of false positives. On the contrary, limma-voom only uses the variance function and does not show excessive false positives, as is the case with a variance stabilizing transformation followed by linear model analysis with limma14,15,16.
All three methods have their own advantages, and the choice is just dependent on the type of data. For example, if there is microarray data, limma should be given with priority, but when it is the next-generation sequencing data, DESeq2 and EdgeR are preferred9,10,17. In summary, we provide here a detailed protocol for RNA-seq differential expression analysis with R packages limma, edgeR and DESeq2, respectively. The output results from the three methods are overlapping partly, and these differential methods have their respective advantages. Unfortunately, this protocol does not cover the technical details for other data types (e.g., microarray data) and methods (e.g., EBSeq)18.
The authors have nothing to disclose.
This work was supported by the National Natural Science Foundation of China (Grant No. 81860276) and Key Special Fund Projects of National Key R&D Program (Grant No. 2018YFC1003200).