RESEARCH
Peer reviewed scientific video journal
Video encyclopedia of advanced research methods
Visualizing science through experiment videos
EDUCATION
Video textbooks for undergraduate courses
Visual demonstrations of key scientific experiments
BUSINESS
Video textbooks for business education
OTHERS
Interactive video based quizzes for formative assessments
Products
RESEARCH
JoVE Journal
Peer reviewed scientific video journal
JoVE Encyclopedia of Experiments
Video encyclopedia of advanced research methods
EDUCATION
JoVE Core
Video textbooks for undergraduates
JoVE Science Education
Visual demonstrations of key scientific experiments
JoVE Lab Manual
Videos of experiments for undergraduate lab courses
BUSINESS
JoVE Business
Video textbooks for business education
Solutions
Language
English
Menu
Menu
Menu
Menu
Research Article
Erratum Notice
Important: There has been an erratum issued for this article. View Erratum Notice
Retraction Notice
The article Assisted Selection of Biomarkers by Linear Discriminant Analysis Effect Size (LEfSe) in Microbiome Data (10.3791/61715) has been retracted by the journal upon the authors' request due to a conflict regarding the data and methodology. View Retraction Notice
This protocol allows initial quality control for RNA-seq experiments for wet-lab biologists with limited bioinformatics experience.
Modern approaches in molecular plant science often require bulk RNA-seq experiments, for example, to track global changes in transcriptomes upon treatments or to identify key components of regulatory pathways. Consequently, diverse areas of plant sciences rely on high-quality and reproducible bulk RNA-seq data for scientific advancement. However, from our experience, knowledge about and application of quality control measures in RNA-seq datasets are often lacking. Here, we introduce Rup (RNA-seq usability assessment pipeline) for the quality control of bulk RNA-seq data, for subsequent gene expression analyses, that is stand-alone and readily applicable for wet-lab biologists with basic knowledge of R. Rup helps to discriminate between sequencing data of high-quality, suitable for downstream gene expression experiments, and those unsuitable for general further analysis. Rup includes tests for several commonly encountered problems such as insufficient read numbers or mapping, identification of contaminations, quantification of rRNA fractions in the total RNA-seq data, replicate similarity testing, using real data for demonstration and offering intuitive visualization. Rup provides a suite of tools to identify experimental shortcomings before standardized transcriptome analysis, thereby improving data quality for individual researchers and the field. This enhances confidence in bulk RNA-seq data analysis and provides a foundation for future guidelines defining minimum quality control criteria, thus improving the reliability and transparency of published RNA-seq data. Rup and test data are available at https://github.com/oliverrupp/rup.
Transcriptomics (RNA-seq) experiments comprehensively interrogate transcriptional signatures shaping phenotypes. This approach became irreplaceable in plant molecular genetics to identify single or co-expressed genes and biological processes involved, for example, in development, plant pathogen interaction, or abiotic stress resistance1, 2 ,3. Recent advancements in RNA-seq technology have increased specificity and enabled the detection of different isoforms and variants at single-base resolution, allowing the identification of sequence variations from larger indels to single-nucleotide polymorphisms (SNPs). The data obtained by RNA-seq are characterized by a broad dynamic range, facilitating the detection of both highly abundant and lowly expressed transcripts, and require appropriate experimental conditions for consistency. In addition, RNA-seq datasets are large and thus computationally intensive to analyze and store, entailing efficient data management and extensive computing resources. RNA-seq requires high-quality input at every step of the workflow, as weaknesses in any stage can propagate and compromise the results. Poor RNA integrity or technical issues during library preparation will lead to bias and a reduction of accuracy. Parameters for obtaining high-quality raw reads can vary between next-generation sequencing (NGS) facilities.
According to our experience, we recommend using only RNA with an RNA integrity number (RIN) above 7, which is indicative of largely intact mRNA structure, as input for RNA-seq library preparation. For successful sequencing, approximately 2 µg of total RNA at a concentration of 50–200 ng/µL is required for standard library preparation protocols. RNA purity should be confirmed by an OD260/280 ratio between 1.8 and 2.1 and an OD260/230 ratio greater than 1.5 using a spectrophotometer. Insufficient sequencing depth, low mapping rates, or misalignment to the reference genome may further distort gene expression and splicing quantification. Moreover, inadequate experimental design, such as a lack of discrimination between samples of different tissues or treatments and high variability between replicates, can introduce noise and diminish reproducibility. While many labs analyze transcriptomes routinely, stringent quality controls of the first essential analysis steps are often not reported or may be entirely absent from publications. This may lead to over-interpretation of results derived from transcriptome analysis and, consequently, to irreproducible results.
Here, we provide a workflow for quality control of the initial steps required for high-quality transcriptome analyses of mRNA profiles to measure transcriptional alterations. Our aim is to enable wet-lab biologists with limited knowledge in bioinformatics to evaluate their primary transcriptomics data. Rup (Figure 1) is accessible to researchers who are familiar with basic knowledge of R. Executing the workflow presented here will provide researchers with a detailed understanding of their primary data, including their potential limitations for subsequent analysis. To our knowledge, a practical primary transcriptome data evaluation pipeline in combination with guidelines for distinguishing high- from low-quality data is lacking so far.

Figure 1: Workflow of the RNA-seq in silico quality control pipeline. The input files for quality control are derived from RNA-seq data generated by sequencing plant material, as well as publicly available datasets. The provided R pipeline assesses sequence quality through three main approaches: sequencing quality, mapping quality, and replicate quality. Various quality metrics are computed, and statistical results are visualized using common R packages such as ggplot2 and pheatmap (e.g., bar charts and heatmaps). Please click here to view a larger version of this figure.
Previously reported evaluation pipelines required pre-processed data (e.g., read alignment as .bam files), do not cover all metrics, or are no longer maintained4,5,6. The advantage of the workflow presented here lies in its comprehensiveness by consolidating the most common quality control issues into a single pipeline. Further, we provide examples for high-quality data suitable for all downstream analysis applications, but also examples for low-quality data, and discuss their specific limitations for further analysis. Several previously published reports describe the purpose of RNA-seq analysis tools and provide comparative evaluations of their performance7,8. However, RNA-seq quality control and methodology reporting standards are not standardized and aggravate reproducibility and biologically meaningful interpretations of transcriptomics experiments.
Rup integrates standard high-quality tools, quality control analysis, and visualization of the results. Rup requires raw sequencing reads and an annotated genome as input and runs on Mac OS, Linux, and “Windows Subsystem for Linux” (WSL) on Windows systems. It integrates tools for the quality of the sequencing and quantifies read mapping to a genome, includes rRNA content measurement in the samples, and offers sample correlation to evaluate replicate correlation. The test data were obtained using laser microdissection of the central meristem tissue of two different stages of the plant species Eschscholzia californica, an ultra-low input protocol for library preparation, and sequenced on Novaseq 6000.
Rup can be run as a single script with minimum input files required. The pipeline, as shown here, requires Illumina paired-end RNA sequencing data. An adjusted pipeline for single-end sequencing with the same analysis steps is also deposited in the GitHub repository. Only the genome sequence as a fasta file, the gene model and rRNA annotations as gtf files, and the raw sequencing reads as fastq.gz files are needed. Make sure that the genome and annotation files are provided as genome.fa, annotation.gtf, and rRNA.gtf in a folder specified in the reference_folder variable. When all sequencing files as .fq.gz files are saved in the read_file_folder variable, Rup can be run as follows:
1. Preparations:
BiocManager::install(c("getopt", "ggplot2", "reshape2", "pheatmap", "fastqcr", "Rfastp", "Rsubread", "Rsamtools"))
# define source folder and all derived subfolders for input and output files
source_folder <- "data"
reference_folder <- file.path(source_folder, "reference")
read_file_folder <- file.path(source_folder, "reads")
results_folder <- file.path(source_folder, "results")
# define input reference files
genome_fasta_file <- file.path(reference_folder, "genome.fa")
annotation_file <- file.path(reference_folder, "annotation.gtf")
rrna_file <- file.path(reference_folder, "rRNA.gtf")
# define result subfolders
fastqc_folder <- file.path(results_folder, "fastqc")
trimmed_fastqc_folder <- file.path(results_folder, "trimmed_fastqc")
trimmed_read_folder <- file.path(results_folder, "trimmed")
bam_folder <- file.path(results_folder, "bam")
sorted_bam_folder <- file.path(results_folder, "sorted_bam")
counts_folder <- file.path(results_folder, "counts")
# create results folders
for(folder in c(results_folder, fastqc_folder, trimmed_fastqc_folder, trimmed_read_folder, bam_folder, sorted_bam_folder, counts_folder)) {
if(!dir.exists(folder)) {
dir.create(folder)
}
}
# get sample prefixes from input fastq files
fastq_files <- list.files(read_file_folder, pattern = "*_1.f(ast)?q.gz")
sample_prefixes <- gsub("_1.f(ast)?q.gz", "", fastq_files)
n_threads <- 8 # the number of available CPU cores
bamSortMemory <- "1024" # maximum memory for bam file sorting
MinReadLength <- 25 # minimum read length, should not be less than 25
minFragLength <- 0 # minimum fragment length distribution
maxFragLength <- 300 # maximum fragment length distribution
orientation <- "fr" # read orientation for read mapping ("fr", "rf", "ff")
stranded <- 0 # stranded sequencing
# (0 (unstranded), 1 (stranded) and 2 (reversely stranded))
2. Sequencing quality assessment
NOTE: This step will create a bar plot showing the number of reads before and after trimming each sample.
# load the fastqc library
library(fastqcr)
# run fastqc on all raw fastq files
fastqc(fq.dir=read_file_folder, qc.dir=fastqc_folder, threads=n_threads)
# aggregate the fastqc statistics
qc <- qc_aggregate(fastqc_folder, progress=F)
qc$tot.seq <- as.numeric(qc$tot.seq)
# remove suffixes from sample names
qc$sample = gsub(".f(ast)?q.gz", "" , qc$sample)
# load the rfastp library
library(Rfastp)
# iterate over sample prefixes
for(prefix in sample_prefixes) {
# create output prefix
# !!! Rfastp automatically adds _R1.fastq.gz and _R2.fastq.gz to the prefix
outputPrefix <- file.path(trimmed_read_folder, prefix)
# get input reads based on sample prefix
read1 = file.path(read_file_folder, paste(prefix, "_1.fastq.gz", sep=""))
read2 = file.path(read_file_folder, paste(prefix, "_2.fastq.gz", sep=""))
if(!file.exists(read1)) { read1 = file.path(read_file_folder, paste(prefix, "_1.fq.gz", sep="")) }
if(!file.exists(read2)) { read2 = file.path(read_file_folder, paste(prefix, "_2.fq.gz", sep="")) }
# run fastp trimmig with minimum read length
fastp_stats <- rfastp(read1 = read1,
read2 = read2,
minReadLength = minReadLength,
outputFastq = outputPrefix,
thread = n_threads)
}
# run fastqc on the trimmed reads
fastqc(fq.dir=trimmed_read_folder, qc.dir=trimmed_fastqc_folder, threads=n_threads)
# aggreagate the fastqc statistics
qc_trimmed <- qc_aggregate(trimmed_fastqc_folder, progress=F)
qc_trimmed$tot.seq <- as.numeric(qc_trimmed$tot.seq)
# correct sample names (change the fastp "R1","R2" suffix to "1","2")
qc_trimmed$sample = gsub("_R([12])$", "_\1" , qc_trimmed$sample)
# load the ggplot2 library for plotting
library(ggplot2)
# add the trimming status to the fastqc results
qc$Trimming = "raw"
qc_trimmed$Trimming = "trimmed"
# combine the results into one vector
qc_all = rbind(qc, qc_trimmed)
# plot the read number before and after trimming as barplot
read_number_plot <- ggplot(qc_all, aes(x=sample, y=tot.seq, fill=Trimming)) +
geom_bar(stat="identity", position = "dodge") +
xlab("Samples") + ylab("Number of Reads") +
theme(axis.text.x = element_text(angle = 90, hjust = 0)) +
theme(text = element_text(size = 18)) +
ggtitle("Number of reads before and afer trimming")
print(read_number_plot)

Figure 2: Sequencing and trimming results. Read counts before (red) and after trimming (green). Sample s2_r1 already has a low read count prior to trimming, whereas trimming removed a large fraction of sample s2_r2 reads. All other samples show acceptable loss of reads from trimming. Please click here to view a larger version of this figure.
3. Mapping quality
NOTE: This step calculates bar plots to discriminate single-mapped reads (e.g., transcripts from protein-coding genes) from multi mapped reads (e.g., rRNA reads) and unmapped reads (e.g., contaminations).
# load the Rsubread library for read mapping and read counting
library(Rsubread)
# create a subread index from the reference genome sequence
buildindex(file.path(reference_folder,"subread.index"), genome_fasta_file)
# load the Rsamtools library for bam file sorting and indexing
library(Rsamtools)
# iterate over sample names
for(prefix in sample_prefixes) {
# create the bam output file name
output_bam = file.path(bam_folder, paste(prefix, ".bam", sep=""))
# align the reads to the reference genome index
mapping_stats <- align(index=file.path(reference_folder,"subread.index"),
# create the forward and reverse read file names based on the prefix
# "_R1.fastq.gz" and "_R2.fastq.gz" are forced by fastp
readfile1=file.path(trimmed_read_folder, paste(prefix, "_R1.fastq.gz", sep="")),
readfile2=file.path(trimmed_read_folder, paste(prefix, "_R2.fastq.gz", sep="")),
output_file=output_bam, # set the output file name
type=0, # the reads are RNA-seq
minFragLength=minFragLength, # the minimal allowed fragment length
maxFragLength=maxFragLength, # the minimal allowed fragment length
PE_orientation=orientation, # read orientation
nthreads=n_threads,
# use a GTF annotation file to support the mapping
useAnnotation=TRUE,
annot.ext=annotation_file,
isGTF=TRUE,
nBestLocations=2) # to distinguish between unique and multi-mapping reads
# create the sorted output file name (the .bam suffix will be added automatically)
output_sorted_bam = file.path(sorted_bam_folder, prefix)
# sort and index the bam file
sortBam(output_bam, output_sorted_bam, maxMemory=bamSortMemory, nThreads=n_threads)
indexBam(paste0(output_sorted_bam, ".bam"))
}
# collect all bam files
bam_files <- list.files(bam_folder, pattern = "*.bam$")
# count the number of reads for each gene
gene_feature_counts = featureCounts(file.path(bam_folder, bam_files),
annot.ext = annotation_file, # the gene models
isGTFAnnotationFile = TRUE,
countMultiMappingReads = FALSE, # only count unique reads
strandSpecific = stranded, # for strand specific data
isPairedEnd = TRUE, # paired-end sequencing
# the following paramter allow to control for false-positive read assignments
requireBothEndsMapped = TRUE,
checkFragLength = TRUE,
minFragLength = minFragLength,
maxFragLength = maxFragLength,
nthreads = n_threads)
# count the reads mapping to rRNA genes
rrna_feature_counts = featureCounts(file.path(bam_folder, bam_files),
annot.ext = rrna_file, # the rRNA gene model file
isGTFAnnotationFile = TRUE,
countMultiMappingReads = TRUE, # rRNA reads usually map to multiple loci
fraction = TRUE,
strandSpecific = stranded,
isPairedEnd = TRUE,
nthreads = n_threads)
# load the reshape2 library for data restructuring
library(reshape2)
# get the assignement stats
gene_count_stats <- gene_feature_counts$stat
# set the sample names as column names
colnames(gene_feature_counts$counts) = gsub(".bam", "", colnames(gene_feature_counts$counts))
# set the assignment type as row names
rownames(gene_count_stats) <- gene_count_stats$Status
# select the columns with the assignment statistics
gene_count_stats <- gene_count_stats[,seq(2, ncol(gene_count_stats))]
# set the sample names as column names
colnames(gene_count_stats) <- gsub(".bam", "", colnames(gene_count_stats))
# transform the dataframe for plotting
transformed_stats <- melt(t(gene_count_stats))
# set the column names
colnames(transformed_stats) <- c("Sample", "Group", "Alignments")
# adjust the groupa and sample ordering for plotting
transformed_stats$Group <- factor(transformed_stats$Group,
levels = rev(levels(transformed_stats$Group)[order(levels(transformed_stats$Group))]))
transformed_stats$Sample <- factor(transformed_stats$Sample,
levels = rev(levels(transformed_stats$Sample)[order(as.character(transformed_stats$Sample))]))
# remove assignment classes without any alignments
transformed_stats <- transformed_stats[transformed_stats$Alignments > 0,]
# the statistics refer to proteins coding genes
transformed_stats$Reference = "Genes"
# collect the rrna assignment statistics
rrna_count_stats <- rrna_feature_counts$stat
# set the sample names as column names
colnames(rrna_count_stats) <- gsub(".bam", "", colnames(rrna_count_stats))
# only the number of assigned reads are important in this case
rrna_counts = as.data.frame(t(rrna_count_stats[rrna_count_stats$Status == "Assigned",2:ncol(rrna_count_stats)]))
colnames(rrna_counts) = "Alignments"
# setup names and categories for plotting
rrna_counts$Sample = rownames(rrna_counts)
rrna_counts$Group = "rRNA"
rrna_counts$Reference = "rRNA"
# combine the proteind coding and rRNA gene statistics
mapping_stats = rbind(transformed_stats, rrna_counts)
# plot the assignment statistics
mapping_stats_plot = ggplot(data = mapping_stats, aes(x = Sample, y = Alignments)) +
geom_col(aes(fill = Group), width = 0.7) +
theme_bw() + facet_wrap(~Reference) +
ylab("Number of Alignments") + xlab("Samples") +
ggtitle("Read Mapping Numbers") +
theme(text = element_text(size = 18)) +
theme(axis.text.x = element_text(angle = 90, hjust = 0))
print(mapping_stats_plot)
# get the reads per gene counts
gene_count_matrix = gene_feature_counts$counts
# set the sample names as column names
colnames(gene_count_matrix) = gsub(".bam", "", colnames(gene_count_matrix))
# group and count genes in classes of specific read counts
read_count_classes = data.frame("no_reads"=colSums(gene_count_matrix == 0),
"at_least_1_read"=colSums(gene_count_matrix >= 1 & gene_count_matrix < 10),
"at_least_10_reads"=colSums(gene_count_matrix >= 10 & gene_count_matrix < 100),
"at_least_100_reads"=colSums(gene_count_matrix >= 100 & gene_count_matrix < 1000),
"at_least_1000_reads"=colSums(gene_count_matrix >= 1000))
# setup data.frame for plotting
read_count_classes$Sample = rownames(read_count_classes)
melt_rcc = melt(read_count_classes)
melt_rcc$variable = as.character(melt_rcc$variable)
# sort the gene groups
melt_rcc$variable = factor(melt_rcc$variable, levels=c("no_reads",
"at_least_1_read",
"at_least_10_reads",
"at_least_100_reads",
"at_least_1000_reads"))
# plot the data as bar plot
gene_coverage_plot <- ggplot(melt_rcc, aes(x=Sample, y=value, fill=variable)) +
geom_bar(stat="identity") +
ylab("Number of Genes") + xlab("Samples") +
ggtitle("Number of Reads per Gene") +
guides(fill=guide_legend(title="Number of assigned reads")) +
theme(text = element_text(size = 18)) +
theme(axis.text.x = element_text(angle = 90, hjust = 0))
print(gene_coverage_plot)

Figure 3: Read mapping overview and statistics. Most samples show a high number of assigned reads (left side, brown) and a low number of rRNA reads (right side, blue). Sample s2_r3 has an unusually high amount of multi-mapped reads (left side, green) corresponding to a high rRNA read number (right side). Sample s2_r4 shows a high number of unmapped reads in combination with an expected rRNA read number, suggestive of contamination with reads from another organism. Please click here to view a larger version of this figure.
4. Replicate quality.
# load the pheatmap library
library(pheatmap)
# compute TPM values
Length_kb = gene_feature_counts$annotation$Length / 1000 # get gene lengths in kb
RPK = gene_feature_counts$counts / Length_kb # normalize read counts by gene length
scaling_factors = colSums(RPK) / 1e6 # get scaling factor based on the sum normalized read counts
TPM = RPK / scaling_factors # scale the normalized read counts to 1e6
# plot the sample/replicate correlation heatmap
# the pearson correlation is computed on the log2 transformed TPM values
pheatmap(cor(log2(TPM+1)), fontsize = 18, main="Sample Correlation Heatmap")

Figure 4: Gene read count classification. All genes in a sample are classified into one of five categories based on the number of assigned reads. Shown in red are the number of genes without any reads assigned. Samples with a low number of total assigned reads (s2_r1 to s1_r4) have a higher number of genes with 10 to 100 assigned reads and a lower number of genes with more than 1000 reads. Genes with low expression may be missed in these samples. Please click here to view a larger version of this figure.
The pipeline is completely implemented as an R script and has been tested on Linux and Mac OS operating systems. Windows users can use Windows Subsystem for Linux (WSL). The code and test data are available as a GitHub repository: https://github.com/oliverrupp/rup. The sequencing data are available under the ENA EBI project PRJEB96400.
Ten samples were created artificially from two real samples to exemplify diverse problems that may be encountered during quality control of bulk RNA-seq using Rup for analysis. Sample s2_r1 was designed to exhibit a low total read number, sample s2_r2 contains a large fraction of low-quality reads to be discarded by the trimming process. Sample s2_r3 includes a large fraction of rRNA reads and sample s2_r4 includes contaminant reads that failed to map to the reference genome. The names of samples s1_r5 and s2_r5 were swapped to illustrate low replicate correlation.
Protocol section 2 enables the identification of samples with low read numbers prior to or following trimming (Figure 2). The bar plot shows the lower read number in sample s1_r1 both before and after trimming. Here, the initial read number was low. The low read number of sample s1_r2 after trimming suggests a large amount of adapter sequences, sequencing errors, primer sequences, poly-A/T stretches sequenced that were removed during the trimming process. Degradation of input RNA may also reduce the number of high-quality reads.
Protocol section 3 identifies problems in read assignments. Figure 3 depicts the elevated number of multi-mapped reads in sample s2_r3 (green), as well as the high number of rRNA reads. Contamination of sample s2_r4 is apparent by the large fraction of reads not mapped to the reference genome (pink). These reads do not correlate to rRNA but to sequences from a non-target organism. Figure 4 shows general problems associated with low numbers of assigned reads in transcriptomes. In samples with a low rate of reads mapped uniquely to the reference genome (s2_r1 to r4) only about a third of the genes have more than 100 reads assigned whereas in the other samples, about half of the genes fall into these classes. Reads of genes with very low expression may not be found in the samples s2_r1 to r4, and consequently, comparative expression analysis with these samples will be highly unreliable and should be avoided.
Protocol section 4 can be used to identify replicate outliers. Replicates of the same sample/condition/tissue are expected to show a higher correlation between each other than to replicates from other samples/conditions/tissues. Figure 5 shows a correlation heatmap of the two samples (S1 and S2) with five replicates each. The dendrograms on top and to the side of the plot show two clusters with five replicates in each cluster. The left cluster contains four replicates of sample 1 and one replicate of sample 2 (s2_r5), the right cluster contains four replicates of sample 2 and one replicate of sample 1 (s1_r5). In this case, when the sample names s1_r5 and s2_r5 are swapped back again, each cluster contains all replicates of one sample, which may indicate a replicate labeling mistake. Other reasons for replicates not clustering together could be a lack of differentiation between the samples/conditions/tissues, or clustering of replicates for sequencing quality reasons only. The latter may occur when all replicates with exceptionally low or those with exceptionally high read counts after trimming and mapping form a cluster.

Figure 5: Sample correlation heatmap. The sample correlation heatmap is based on log2 transformed TPM values and shows two distinct clusters of five samples each. Biological/technical replicates are expected to show a higher correlation between each other than those from other tissues/treatments. The left cluster contains four replicates of sample 1 and one replicate of sample 2 (s2_r5), the right cluster contains four replicates of sample 2 and one replicate of sample 1 (s1_r5). The single replicates should be checked for possible sample swapping or batch effects to explain their clustering in the heatmap. Please click here to view a larger version of this figure.
Table 1: Comparison of RNA-seq quality assessment pipelines. For a detailed analysis, see Supplemental File 1. Please click here to download this Table.
Supplemental File 1: Parameter and reference selection. The analyses show the influence of parameter and reference selection on the overall QC results. Please click here to download this File.
The quality of differential gene expression analyses highly depends on two factors: the number of reads sequenced in each sample and replicate16 and the number of replicates per sample17,18. Here, we present the user-friendly Rup pipeline to determine the number of reads suitable for gene expression quantification in each replicate. Different metrics allow researchers to understand why replicates show low assigned read numbers and to identify problems in sample correlation. Although Rup was developed for quality assessment of plant RNA-seq samples, it is equally suitable for other eukaryotic organisms, requiring no further adjustments for non-plant samples (Supplemental File 1).
The first step of Rup determines the total read count before and after read trimming. The number of sequenced reads determines the number of detectable genes, and if the read number is too low, many differentially expressed genes will remain undetected. A large fraction of reads discarded during the quality trimming and filtering process might indicate RNA degradation, which could lead to an underestimation of expression for highly degraded genes. However, whether a sample can be used for downstream applications depends on the target organism and the aim of the study. For example, to capture the expression of most plant genes, in our experience, 30 million to 50 million reads are required, whereas for fungi, only 10 million might be sufficient. Thus, Rup will not define quality thresholds for exclusion of problematic samples but rather provide metrics to identify diverse problems that may be encountered.
The second step (mapping quality) evaluates the accuracy and reliability of read alignment to the reference genome, specifying their suitability for downstream analyses. Not all sequenced reads can be used for gene abundance computation; reads that fail to align to the genome or transcriptome or reads that map to multiple locations on the genome are ignored in most cases19 and do not contribute to the overall read counts. The results of the second step of the pipeline can be used to understand why reads are not used in the abundance computation. A high number of unmapped reads could indicate contamination during RNA extraction (e.g., with plant pathogens or herbivores) or an incomplete reference genome. Incomplete gene models of the reference genome may result in a high number of “no feature” reads. A large fraction of multi-mapped reads could be caused by a high number of rRNA genes in the library indicating insufficient rRNA removal during the sequencing library preparation process (in the case that the multi-mapped reads are truly derived from rRNAs, they can be analyzed further by providing an rRNA annotation file to the pipeline, which then automatically computes the number of possible rRNA reads).
The third, equally important quality metric is the correlation between replicates of the same sample. Generally, the correlation between replicates of the same sample should be higher than the correlation between replicates of different samples. A low correlation between replicates could indicate large biological variability between the replicates, high similarity between the samples/conditions/tissues, other batch effects, or even sample swapping or mislabeling. Rup computes pair-wise correlations between all samples and produces a clustered heatmap of the sample correlations. Additionally, a principal component analysis (PCA) can be computed on the samples to identify possible batch effects that need to be acknowledged in further downstream analyses. Although batch effects can be corrected in downstream analyses, it might be better to remove problematic samples if the measured difference is too high.
Input material has a direct impact on sequencing quality: tissue sampling, RNA extraction, and library preparation are critical steps for minimizing RNA-seq quality issues. Consistent conditions should be maintained, if possible, for example, by using growth chambers. Pest control measures should be carried out in a timely manner, as only healthy individuals should be selected for sampling. It is further advisable to collect samples on the same day and/or at the same time to reduce circadian transcriptome variation. Numerous RNA extraction kits are available, and selecting an appropriate kit for the target species can improve mRNA quality. The library preparation protocols should include steps for polyA+ RNA enrichment to minimize the rRNA fraction.
Rup can be used as an initial quality control step in differential gene expression analysis. This quality control is required for several critical reasons, as it ensures the quality of the input data (identifies low-quality replicates/samples, can set quality thresholds for read depth and mapping rates) and identifies technical issues such as sequencing errors, batch effects, or mislabeling. If the RNA input is of sufficient quality, Rup can assist by trimming low-quality reads or identifying mislabeled samples. However, Rup cannot compensate for poor input RNA quality. In cases of low-quality RNA or erroneous sequencing data, it may be necessary to re-collect samples, adjust the RNA extraction protocol, and/or repeat sequencing. The same applies to samples with a large fraction of unmapped reads that may have been caused by contamination. However, from our experience, not all RNA extractions can be simply repeated due to a lack of material availability. In this case, they are unsuited for some downstream applications: samples with low numbers of single-mapped reads should not be analyzed in differential gene expression analyses, but they still hold information for the analysis of transcript presence. In this case, the absence of transcripts in tissues/treatments/conditions cannot be considered for the analysis and quantification of transcript abundance.
Rup includes some limitations. For example, it does not directly test for RNA degradation as direct measurements of RNA integrity are required before library preparation and sequencing. However, a script to identify RNA degradation using the RSeQC modules geneBodyCoverage.py and tin.py is included in the repository of this pipeline. Furthermore, Rup does not test for GC content and transcript length bias. The reference genome size is currently limited to 4 Gb, and according to our experience, the read mapping module overestimates multi-mapped reads in polyploids, as exemplified in the Supplemental File in a comparison of read mapping to a haploid vs. the diploid genome versions of E. californica, a haploid genome is thus the preferred input for this pipeline. The Rup output quality largely depends on the quality of the reference genome and annotation, such that an incomplete or highly fragmented genome sequence could lead to an overestimation of unmapped reads. Moreover, incomplete gene model annotation leads to an overestimation of unassigned reads. Completeness and duplication of the genome sequence and the gene annotation can be inferred with tools like BUSCO20.
Rup is a stand-alone tool for all initial quality control steps essential in RNA-seq experiments. Unlike RSeQC and RNA-SeQC, preprocessing of the raw sequencing reads is not required. Sequencing quality analysis cannot be done with RNA-SeQC, and sample correlation heatmaps are not calculated by RSeQC and RNA-SeQC (Table 1). Further, the normalization output is in FPKM in RSeQC and output visualization is not implemented as a standard for all modules in RSeQC and RNA-SeQC. Thus, several quality control issues are not tested in the two alternative tools, which include low read number, high fraction of trimmed reads and the identification of replicate outliers. A comparison of Rup, RSeQC and RNA-SeQC is available in Supplemental File 1. Moreover, Rup can be used complementary to RNA-SeQC or RSeQC, for example, the sorted BAM files produced by this pipeline can be used as input for these tools.
Currently, Rup is optimized for a small number of samples, but the most time-consuming steps, like quality trimming and read mapping, can be precomputed, for example, on a computer cluster or a cloud infrastructure. For genomes larger than 4 Gb, this is mandatory, since the Rsubread aligner is limited to genomes smaller than 4 Gb. rRNA annotation is done with barrnap, which identifies the highly conserved rRNA genes. According to our experience, rRNA gene annotation does not require comprehensiveness, because even if some abnormal rRNA genes are missing, a gross overview of rRNA presence in the dataset is sufficient for dataset quality assessment. Future versions of Rup might switch to a different mapping method, like the pseudo-alignment tool salmon21, to decrease runtime. Moreover, more normalization and correction methods, like GC bias or length bias correction, could be added to Rup.
In summary, Rup provides essential information to ensure the reliability and reproducibility of the RNA-seq data analysis. This pipeline comprehensively reports the main RNA-seq quality metrics and produces output files for direct use in downstream analyses. It was designed as a stand-alone tool for researchers with minimal bioinformatics knowledge to assess their primary sequencing data quality with intuitive visualization.
The authors have no conflicts of interest to declare.
We acknowledge technical assistance by the Bioinformatics Core Facility at the professorship of Systems Biology at JLU Giessen and provision of compute resources and general support by the BiGi service center (BMFB grant 031A533) within the de.NBI network. The work presented here was funded by the German Research Foundation (DFG) grant BE2547/24-1 to A.B., and we are also thankful for the support of the Justus Liebig University Giessen, Germany.
| fastqcr | R | 0.1.3 | |
| FUJITSU ESPRIMO D958 Desktop | FUJITSU | the pipeline was developed and tested on Ubuntu Linux 24.02 (32 Gb RAM) | |
| getopt | R | 1.20.4 | |
| ggplot2 | R | 3.5.2 | |
| MacBook Pro | Apple | the pipeline was tested on maxOS 15.4.1 (16 Gb RAM) | |
| pheatmap | R | 1.0.13 | |
| R | 4.4.3 | ||
| reshape2 | R | 1.4.4 | |
| rfastp | bioconductor | 1.16.0 | |
| rsamtools | bioconductor | 2.22.0 | |
| rsubread | bioconductor | 2.20.0 |