Three Differential Expression Analysis Methods for RNA Sequencing: limma, EdgeR, DESeq2

Shiyi Liu; Zitao Wang; Ronghui Zhu; Feiyan Wang; Yanxiang Cheng; Yeqiang Liu

doi:10.3791/62528

Cancer Research

Three Differential Expression Analysis Methods for RNA Sequencing: limma, EdgeR, DESeq2

Published: September 18, 2021 doi: 10.3791/62528

Shiyi Liu*¹, Zitao Wang*¹, Ronghui Zhu¹, Feiyan Wang², Yanxiang Cheng¹, Yeqiang Liu²

¹Department of Obstetrics and Gynecology, Renmin Hospital of Wuhan University, ²Department of Pathology, Shanghai Skin Disease Hospital, Tongji University School of Medicine

* These authors contributed equally

Summary

A detailed protocol of differential expression analysis methods for RNA sequencing was provided: limma, EdgeR, DESeq2.

Abstract

RNA sequencing (RNA-seq) is one of the most widely used technologies in transcriptomics as it can reveal the relationship between the genetic alteration and complex biological processes and has great value in diagnostics, prognostics, and therapeutics of tumors. Differential analysis of RNA-seq data is crucial to identify aberrant transcriptions, and limma, EdgeR and DESeq2 are efficient tools for differential analysis. However, RNA-seq differential analysis requires certain skills with R language and the ability to choose an appropriate method, which is lacking in the curriculum of medical education.

Herein, we provide the detailed protocol to identify differentially expressed genes (DEGs) between cholangiocarcinoma (CHOL) and normal tissues through limma, DESeq2 and EdgeR, respectively, and the results are shown in volcano plots and Venn diagrams. The three protocols of limma, DESeq2 and EdgeR are similar but have different steps among the processes of the analysis. For example, a linear model is used for statistics in limma, while the negative binomial distribution is used in edgeR and DESeq2. Additionally, the normalized RNA-seq count data is necessary for EdgeR and limma but is not necessary for DESeq2.

Here, we provide a detailed protocol for three differential analysis methods: limma, EdgeR and DESeq2. The results of the three methods are partly overlapping. All three methods have their own advantages, and the choice of method only depends on the data.

Introduction

RNA-sequencing (RNA-seq) is one of the most widely used technologies in transcriptomics with many advantages (e.g., high data reproducibility), and has dramatically increased our understanding of the functions and dynamics of complex biological processes¹^,². Identification of aberrate transcripts under different biological context, which are also known as differentially expressed genes (DEGs), is a key step in RNA-seq analysis. RNA-seq makes it possible to get a deep understanding of pathogenesis related molecular mechanisms and biological functions. Therefore, differential analysis has been regarded as valuable for diagnostics, prognostics and therapeutics of tumors³^,⁴^,⁵. Currently, more open-source R/Bioconductor packages have been developed for RNA-seq differential expression analysis, particularly limma, DESeq2 and EdgeR¹^,⁶^,⁷. However, differential analysis requires certain skills with R language and the ability to choose the appropriate method, which is lacking in the curriculum of medical education.

In this protocol, based on the cholangiocarcinoma (CHOL) RNA-seq count data extracted from The Cancer Genome Atlas (TCGA), three of the most known methods (limma⁸, EdgeR⁹ and DESeq2¹⁰) were carried out, respectively, by the R program¹¹ to identify the DEGs between CHOL and normal tissues. The three protocols of limma, EdgeR and DESeq2 are similar but have different steps among the processes of the analysis. For example, the normalized RNA-seq count data is necessary for EdgeR and limma⁸^, ⁹, whereas DESeq2 uses its own library discrepancies to correct data instead of normalization¹⁰. Furthermore, edgeR is specifically suitable for RNA-seq data, while the limma is used for microarrays and RNA-seq. A linear model is adopted by limma to assess the DEGs¹², while the statistics in edgeR are based on the negative binomial distributions, including empirical Bayes estimation, exact tests, generalized linear models and quasi-likelihood tests⁹.

In summary, we provide the detailed protocols of RNA-seq differential expression analysis by using limma, DESeq2 and EdgeR, respectively. By referring to this article, users can easily perform the RNA-seq differential analysis and choose the appropriate differential analysis methods for their data.

Subscription Required. Please recommend JoVE to your librarian.

Protocol

NOTE: Open the R-studio program and load R file "DEGs.R", the file can be acquired from Supplementary files/Scripts.

1. Downloading and pre-processing of data

Download the high-throughput sequencing (HTSeq) count data of cholangiocarcinoma (CHOL) from The Cancer Genome Atlas (TCGA). This step can be easily achieved by the following R code.
1. Click Run to install R packages.
2. Click Run to load R packages.
  if(!requireNamespace("BiocManager", quietly=TRUE))
  + install.packages("BiocManager")
  BiocManager::install(c("TCGAbiolinks", "SummarizedExperiment"))
3. Set the working directory.
  library (TCGAbiolinks)
  library(SummarizedExperiment)
  setwd("C:/Users/LIUSHIYI/Desktop")
4. Choose the cancer type.
  cancer <- "TCGA-CHOL"
5. Run the R code from the "GDCquery.R" file to download the data. The file "GDCquery.R" can be acquired from Supplementary files/Scripts:
  source("Supplementary files/Scripts/GDCquery.R")
  head(cnt)
  ##TCGA-3X-AAVA-01A-11R-A41I-07
  ##ENSG00000000003 4262
  ##ENSG00000000005 1
  ##ENSG00000000419 1254
  ##ENSG00000000457 699
  ##ENSG00000000460 239
  ##ENSG00000000938 334
  NOTE: After execution, the CHOLHTSeq count data will be downloaded and named "cnt", where rows represent ensemble gene IDs and columns represent sample IDs. Please notice the numbers at positions 14-15 in the sample IDs; numbers ranging from 01 to 09 indicate tumors and ranging from 10 to 19 indicate normal tissues.
Convert ensemble gene IDs to gene symbols.
1. Import the annotation file into R according to its storage path. The annotation file (gencode.v22.annotation.gtf) can be acquired from Supplementary files.
  gtf_v22 <- rtracklayer::import('Supplementary files/gencode.v22.annotation.gtf')
2. Run the R code from the "gtf_v22.R" file, which can be acquired from Supplementary files/Scripts:
  source("Supplementary files/Scripts/gtf_v22.R")
3. Apply the function "ann" to convert the ensemble gene IDs to gene symbols.
  cnt=ann(cnt,gtf_v22)
Filtering low-expressed genes
1. Click Run to install the R package "edgeR".
  BiocManager::install("edgeR")
2. Click Run to load the R package "edgeR".
  library(edgeR)
3. Run the following R code to keep genes with counts per million (CPM) values greater than one in at least two samples.
  keep <- rowSums(cpm(cnt)>1)>=2
  cnt <- as.matrix(cnt[keep,])
  NOTE: The counts per million (CPM) value is used instead of the read count to eliminate the deviation caused by different sequencing depths.

2. Differential expression analysis through "limma"

Click Run to install the R package "limma".
BiocManager::install("limma")
Click Run to load the R packages "limma", "edgeR".
library(limma)
library(edgeR)
Run the following R code to create the design matrix.
group <- substring(colnames(cnt),14,15) # Extract group information
group [group %in% "01"] <- "Cancer" # set '01' as tumor tissue
group [group %in% "11"] <- "Normal" # set '11' as normal tissue
group <- factor (group, levels = c("Normal","Cancer"))
1. Create the design matrix.
  design <- model.matrix (~group)
  rownames(design) <- colnames(cnt)
2. Create the DGEList object.
  dge <- DGEList(counts = cnt, group = group)
3. Normalize the data.
  dge <- calcNormFactors(dge, method = "TMM")
4. Run the following R code to perform the limma-trend method based differential expression analysis.
  dge
  ##An object of class "DGEList"
  ##$counts
  ##TCGA-3X-AAVA-01A-11R-A41I-07
  ##TSPAN6 4262
  ##DPM1 1254
  ##SCYL3 699
  ##C1orf112 239
  ##FGR 334
5. Calculate the CPM value.
  logdge <- cpm(dge, log=TRUE, prior.count=3)
6. Click Run to fit a linear model to predict the data or infer the relationship between variables.
  fit <- lmFit (logdge, design)
7. Calculate the T value, F value and log-odds based on Bayesian.
  fit <- eBayes(fit, trend=TRUE)
8. Extract the result table.
  res_limma<- as.data.frame(topTable(fit,n=Inf))
  
  head(res_limma)
  ## logFC AveExpr t P.Value adj.P.Val B
  ##RP11-252E2.2 -4.899493 -2.488589 -20.88052 2.386656e-25 4.931786e-21 47.28823
  ##BX842568.1 -4.347930 -2.595205 -20.14532 1.082759e-24 1.118706e-20 45.83656
  ##CTC-537E7.3 -5.154894 -2.143292 -19.59571 3.452354e-24 2.216114e-20 44.72001
  ##RP11-468N14.3 -6.532259 -2.029714 -19.49409 4.289807e-24 2.216114e-20 44.51056
  ##AP006216.5 -4.507051 -2.670915 -19.25649 7.153356e-24 2.956339e-20 44.01704
  ##RP11-669E14.4 -4.107204 -2.828311 -18.93246 1.448209e-23 4.987633e-20 43.33543
  #The result of differential expression analysis is saved in "res_limma", which includes the gene id, log2 fold change value (logFC), the average log2 expression level of the gene in the experiment (AveExpr), the modified t statistic (t), relavent p value (P.Value), the false discovery rate (FDR) corrected p value (adj.P.Val) and the log-odds of differentially expressed genes (B)
  NOTE: The function "calcNormFactors()" of the "edgeR" was used to normalize the data to eliminate the influence caused by sample preparation or library construction and sequencing. In the construction of design matrix, it is necessary to match experimental design (e.g., tissue type: normal or tumor tissues) to sample IDs of the matrix. limma-trend is suitable to data whose sequencing depth is the same, while limma-voom is suitable: (i) when the sample library size is different; (ii) data not normalized by TMM; (iii) there is a lot of "noise" in the data. A positive logFC means that gene is up-regulated in the experiment, while negative number means that gene is down-regulated.
9. Identify the DEGs.
  res_limma$sig <- as.factor(
  ifelse(res_limma$adj.P.Val < 0.05 & abs(res_limma$logFC) > 2,
  ifelse(res_limma$logFC > 2 ,'up','down'),'not')) # The adj.p Value < 0.05 and the |log2FC| >= 2 are thresholds to identify the DEGs
  summary(res_limma$sig)
  ##down not up
  ##1880 17341 1443
10. Output the result table to a file.
  write.csv(res_limma, file = 'result_limma.csv')
11. Click Run to install the R package "ggplot2".
  install.packages("ggplot2")
12. Click Run to load the R package "ggplot2".
  library(ggplot2)
13. Run the R code from the "volcano.R" to create the volcano plot. The file "volcano.R" can be acquired from Supplementary files.
  source("Supplementary files/Scripts/volcano.R")
  volcano(res_limma,"logFC","adj.P.Val",2,0.05)
  NOTE: Genes can be mapped to different positions according to their log2FC and adj-p values, the up regulated DEGs are colored in red, and the down-regulated DEGs are colored in green.
14. Click Export to save the volcano plot.
  NOTE: The volcano plots can be generated and downloaded in different formats (e.g., pdf, TIFF, PNG, JPEG format). Genes can be mapped to different positions according to their log2FC and adj p values, the up-regulated DEGs (log2FC > 2, adj p < 0.05) are colored in red, and the down-regulated DEGs (log2FC < -2, adj p < 0.05) are colored in green, non-DEGs are colored in grey.

3. Differential expression analysis through "edgeR"

Click Run to load the R package "edgeR".
library(edgeR)
Run the following R code to create design matrix.
group <-substring(colnames(cnt),14,15)
group [group %in% "01"] <- "Cancer"
group [group %in% "11"] <- "Normal"
group=factor(group, levels = c("Normal","Cancer"))
design <-model.matrix(~group)
rownames(design) = colnames(cnt)
Click Run to create the DGEList object.
dge <- DGEList(counts=cnt)
Normalize the data.
dge <- calcNormFactors(dge, method = "TMM")
Click Run to estimate the dispersion of gene expression values.
dge <- estimateDisp(dge, design, robust = T)
Click Run to fit model to count data.
fit <- glmQLFit(dge, design)
Conduct a statistical test.
fit <- glmQLFTest(fit)
Extract the result table. The result is saved in "res_edgeR", which includes the log fold change value, log CPM, F, p value and FDR corrected p value.
res_edgeR=as.data.frame(topTags(fit, n=Inf))
head(res_edgeR)
## logFC logCPM F PValue FDR
##GCDH -3.299633 5.802700 458.5991 1.441773e-25 2.979280e-21
##MSMO1 -3.761400 7.521111 407.0416 1.730539e-24 1.787993e-20R
##CL1 -3.829504 5.319641 376.5043 8.652474e-24 5.516791e-20
##ADI1 -3.533664 8.211281 372.6671 1.067904e-23 5.516791e-20
##KCNN2 -5.583794 3.504017 358.6525 2.342106e-23 9.679455e-20
##GLUD1 -3.287447 8.738080 350.0344 3.848408e-23 1.194406e-19
#The result is saved in "res_edgeR", which includes the log fold change value(logFC), log CPM, F, p value and FDR corrected p value
Identify the DEGs.
res_edgeR$sig = as.factor(
ifelse(res_edgeR$FDR < 0.05 & abs(res_edgeR$logFC) > 2,
ifelse(res_edgeR$logFC > 2 ,'up','down'),'not'))
summary(res_edgeR$sig)
##down not up
##1578 15965 3121
Output the result table to a file.
write.csv(res_edgeR, file = 'res_edgeR.csv')
Create the volcano plot.
volcano(res_edgeR,"logFC","FDR",2,0.05)
Click Export to save the volcano plot.

4. Differential expression analysis through "DESeq2"

Click Run to install R packages "DESeq2".
BiocManager::install("DESeq2")
Click Run to load R packages "DESeq2".
library(DESeq2)
Run the following R code to determine the grouping factor.
group <-substring(colnames(cnt),14,15)
group [group %in% "01"] <- "Cancer"
group [group %in% "11"] <- "Normal"
group=factor(group, levels = c("Normal","Cancer"))
Create the DESeqDataSet object.
dds <-DESeqDataSetFromMatrix (cnt, DataFrame(group), design = ~group)
dds
##class: DESeqDataSet
##dim: 20664 45
##metadata(1): version
##assays(1): counts
##rownames(20664): TSPAN6 DPM1 ... RP11-274B21.13 LINC01144
##rowData names(0):
##colnames(45): TCGA-3X-AAVA-01A-11R-A41I-07 ...
##colData names(1): group
Perform the analysis.
dds <- DESeq(dds)
Generate the result table.
res_DESeq2 <- data.frame(results(dds))

head(res_DESeq2)
## baseMean log2FoldChange lfcSE stat pvalue padj
##TSPAN6 4704.9243 -0.8204515 0.3371667 -2.433370 1.495899e-02 2.760180e-02
##DPM1 1205.9087 -0.3692497 0.1202418 -3.070894 2.134191e-03 4.838281e-03
##SCYL3 954.9772 0.2652530 0.2476441 1.071106 2.841218e-01 3.629059e-01
##C1orf112 277.7756 0.7536911 0.2518929 2.992109 2.770575e-03 6.101584e-03
##FGR 345.8789 -0.6423198 0.3712729 -1.730047 8.362180e-02 1.266833e-01
##CFH 27982.3546 -3.8761382 0.5473363 -7.081823 1.422708e-12 1.673241e-11
NOTE: The result is saved in "res_DESeq2", which includes the mean of the normalized read count (baseMean), log fold Change value(log2FoldChange), log fold change standard error (lfcSE), the Wald statistic (stat), original p value (pvalue) and corrected p value (padj)
Identify DEGs.
res_DESeq2$sig = as.factor(
ifelse(res_DESeq2$padj < 0.05 & abs(res_DESeq2$log2FoldChange) > 2,
ifelse(res_DESeq2$log2FoldChange > 2 ,'up','down'),'not'))
summary(res_DESeq2$sig)
##down not up
##1616 16110 2938
Output the result table to a file.
write.csv(res_DESeq2, file = 'res_DESeq2.csv')
Create the volcano plot.
volcano(res_DESeq2,"log2FoldChange","padj",2,0.05)
Click Export to save the volcano plot.

5. Venn diagram

Click Run to install the R package "VennDiagram".
install.packages("VennDiagram")
Click Run to load the R package "VennDiagram".
library (VennDiagram)
Make a Venn diagram of up regulated DEGs.
grid.newpage()
grid.draw(venn.diagram(list(Limma=rownames(res_
limma[res_limma$sig=="up",]),
edgeR=rownames(res_edgeR[res_edgeR$sig=="up",]),
DESeq2=rownames(res_DESeq2[res_DESeq2$sig==
"up",])),
NULL,height = 3,width = 3,units = "in",
col="black",lwd=0.3,fill=c("#FF6666","#FFFF00",
"#993366"),
alpha=c(0.5, 0.5, 0.5),main = "Up-regulated DEGs"))
Click Export to save the Venn diagram.
Make a Venn diagram of down regulated DEGs.
grid.newpage()
grid.draw(venn.diagram(list(Limma=rownames(res_
limma[res_limma$sig=="down",]),
edgeR=rownames(res_edgeR[res_edgeR$sig==
"down",]),
DESeq2=rownames(res_DESeq2[res_DESeq2$sig=="down",])),
NULL,height = 3,width = 3,units = "in",
col="black",lwd=0.3,fill=c("#FF6666","#FFFF00",
"#993366"),
alpha=c(0.5, 0.5, 0.5),main = "Down-regulated DEGs"))
Click Export to save the Venn diagram.

Subscription Required. Please recommend JoVE to your librarian.

Representative Results

There are various approaches to visualize the result of differential expression analysis, among which the volcano plot and Venn diagram are particularly used. limma identified 3323 DEGs between the CHOL and normal tissues with the |logFC|≥2 and adj.P.Val <0.05 as thresholds, among which 1880 were down-regulated in CHOL tissues and 1443 were up-regulated (Figure 1a). Meanwhile, edgeR identified the 1578 down-regulated DEGs and 3121 up-regulated DEGs (Figure 1b); DESeq2 identified the 1616 down-regulated DEGs and 2938 up-regulated DEGs (Figure 1c). Comparing the results of these three methods, 1431 up-regulated DEGs and 1531 down-regulated DEGs were overlapped (Figure 2).

Figure 1. Identification of differentially expressed genes (DEGs) between CHOL and normal tissues. (a-c) The volcano plots of all genes acquired by limma, edgeR and DESeq2, respectively, adj p value (-log10) is plotted against the fold change (log2), red points represent the up-regulated DEGs (adjusted p value<0.05 and log |FC|> 2) and the green points represent the down-regulated DEGs (adjusted p value< 0.05 and log |FC|< 2). Please click here to view a larger version of this figure.

Figure 2. Venn diagrams show overlap among the results derived from the limma, edgeR and DESeq2. Please click here to view a larger version of this figure.

Supplementary Files. Please click here to download this File.

Subscription Required. Please recommend JoVE to your librarian.

Discussion

Abundant aberrate transcripts in cancers can be easily identified by RNA-seq differential analysis⁵. However, the application of RNA-seq differential expression analysis is often restricted as it requires certain skills with R language and the capacity to choose appropriate methods. To address this problem, we provide a detailed introduction to the three most known methods (limma, EdgeR and DESeq2) and tutorials for applying the RNA-seq differential expression analysis. This will facilitate the understanding of the similarities and differences across all three methods, enable the selection of a suitable method for individual data, and enable us to understand the complex dynamic biological processes.

Here, we present a detailed protocol for RNA-seq differential expression analysis through limma, edgeR and DESeq2 respectively, in five stages: (i) downloading and pre-processing of data, (ii-iv) differential expression analysis through limma, edgeR and DESeq2, respectively, (v) comparison of the results of these three methods through a Venn diagram.

The three methods have similar and different steps among the processes of the differential expression analysis. A linear model is used for statistics in limma, which is applicable for all gene expression technologies, including microarrays, RNA-seq and quantitative PCR⁸^,¹³, while edgeR and DESeq2 implement a range of statistical methodologies based on the negative binomial distribution⁹^,¹⁰, and edgeR and DESeq2 are suitable for RNA-seq data. In addition, the normalized RNA-seq count data is necessary for EdgeR and limma, whereas DESeq2 uses its own library discrepancies to correct data instead of normalization and the data in DESeq2 must be an integer matrix. The normalization methods include TMM (trimmed mean of M-values), TMMwsp, RLE (relative log expression) and upperquartile, among which TMM is the most commonly used normalization method for RNA-seq data. The results of the three methods showed that DESeq2 and EdgeR obtain more DEGs than limma. The reason for this difference is that edgeR and DESeq2 are based on the negative binomial model, which contributes to large numbers of false positives. On the contrary, limma-voom only uses the variance function and does not show excessive false positives, as is the case with a variance stabilizing transformation followed by linear model analysis with limma¹⁴^,¹⁵^,¹⁶.

All three methods have their own advantages, and the choice is just dependent on the type of data. For example, if there is microarray data, limma should be given with priority, but when it is the next-generation sequencing data, DESeq2 and EdgeR are preferred⁹^,¹⁰^,¹⁷. In summary, we provide here a detailed protocol for RNA-seq differential expression analysis with R packages limma, edgeR and DESeq2, respectively. The output results from the three methods are overlapping partly, and these differential methods have their respective advantages. Unfortunately, this protocol does not cover the technical details for other data types (e.g., microarray data) and methods (e.g., EBSeq)¹⁸.

Subscription Required. Please recommend JoVE to your librarian.

Disclosures

The manuscript has not been published before and is not being considered for publication elsewhere. All authors have contributed to the creation of this manuscript for important intellectual content and read and approved the final manuscript. We declare there is no conflict of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant No. 81860276) and Key Special Fund Projects of National Key R&D Program (Grant No. 2018YFC1003200).