-1::1
Simple Hit Counter
Skip to content

Products

Solutions

×
×
Sign In

EN

EN - EnglishCN - 简体中文DE - DeutschES - EspañolKR - 한국어IT - ItalianoFR - FrançaisPT - Português do BrasilPL - PolskiHE - עִבְרִיתRU - РусскийJA - 日本語TR - TürkçeAR - العربية
Sign In Start Free Trial

RESEARCH

JoVE Journal

Peer reviewed scientific video journal

Behavior
Biochemistry
Bioengineering
Biology
Cancer Research
Chemistry
Developmental Biology
View All
JoVE Encyclopedia of Experiments

Video encyclopedia of advanced research methods

Biological Techniques
Biology
Cancer Research
Immunology
Neuroscience
Microbiology
JoVE Visualize

Visualizing science through experiment videos

EDUCATION

JoVE Core

Video textbooks for undergraduate courses

Analytical Chemistry
Anatomy and Physiology
Biology
Cell Biology
Chemistry
Civil Engineering
Electrical Engineering
View All
JoVE Science Education

Visual demonstrations of key scientific experiments

Advanced Biology
Basic Biology
Chemistry
View All
JoVE Lab Manual

Videos of experiments for undergraduate lab courses

Biology
Chemistry

BUSINESS

JoVE Business

Video textbooks for business education

Accounting
Finance
Macroeconomics
Marketing
Microeconomics

OTHERS

JoVE Quiz

Interactive video based quizzes for formative assessments

Authors

Teaching Faculty

Librarians

K12 Schools

Products

RESEARCH

JoVE Journal

Peer reviewed scientific video journal

JoVE Encyclopedia of Experiments

Video encyclopedia of advanced research methods

JoVE Visualize

Visualizing science through experiment videos

EDUCATION

JoVE Core

Video textbooks for undergraduates

JoVE Science Education

Visual demonstrations of key scientific experiments

JoVE Lab Manual

Videos of experiments for undergraduate lab courses

BUSINESS

JoVE Business

Video textbooks for business education

OTHERS

JoVE Quiz

Interactive video based quizzes for formative assessments

Solutions

Authors
Teaching Faculty
Librarians
K12 Schools

Language

English

EN

English

CN

简体中文

DE

Deutsch

ES

Español

KR

한국어

IT

Italiano

FR

Français

PT

Português do Brasil

PL

Polski

HE

עִבְרִית

RU

Русский

JA

日本語

TR

Türkçe

AR

العربية

    Menu

    JoVE Journal

    Behavior

    Biochemistry

    Bioengineering

    Biology

    Cancer Research

    Chemistry

    Developmental Biology

    Engineering

    Environment

    Genetics

    Immunology and Infection

    Medicine

    Neuroscience

    Menu

    JoVE Encyclopedia of Experiments

    Biological Techniques

    Biology

    Cancer Research

    Immunology

    Neuroscience

    Microbiology

    Menu

    JoVE Core

    Analytical Chemistry

    Anatomy and Physiology

    Biology

    Cell Biology

    Chemistry

    Civil Engineering

    Electrical Engineering

    Introduction to Psychology

    Mechanical Engineering

    Medical-Surgical Nursing

    View All

    Menu

    JoVE Science Education

    Advanced Biology

    Basic Biology

    Chemistry

    Clinical Skills

    Engineering

    Environmental Sciences

    Physics

    Psychology

    View All

    Menu

    JoVE Lab Manual

    Biology

    Chemistry

    Menu

    JoVE Business

    Accounting

    Finance

    Macroeconomics

    Marketing

    Microeconomics

Start Free Trial
Loading...
Home
JoVE Journal
Biology
Rup (RNA-seq Usability Assessment Pipeline) – Quality Control for Bulk RNA-seq Experiments ...

Research Article

Rup (RNA-seq Usability Assessment Pipeline) – Quality Control for Bulk RNA-seq Experiments in Eukaryotes

DOI: 10.3791/69253

November 7, 2025

Oliver Rupp1, Le-Han Roessner2, Doudou Kong2, Annette Becker2

1Bioinformatics and Systems Biology,Justus Liebig University, 2Institute of Botany,Justus Liebig University

Cite Watch Download PDF Download Material list

In This Article

Summary Abstract Introduction Protocol Representative Results Discussion Disclosures Acknowledgements Materials References Reprints and Permissions

Erratum Notice

Important: There has been an erratum issued for this article. View Erratum Notice

Retraction Notice

The article Assisted Selection of Biomarkers by Linear Discriminant Analysis Effect Size (LEfSe) in Microbiome Data (10.3791/61715) has been retracted by the journal upon the authors' request due to a conflict regarding the data and methodology. View Retraction Notice

Summary

This protocol allows initial quality control for RNA-seq experiments for wet-lab biologists with limited bioinformatics experience.

Abstract

Modern approaches in molecular plant science often require bulk RNA-seq experiments, for example, to track global changes in transcriptomes upon treatments or to identify key components of regulatory pathways. Consequently, diverse areas of plant sciences rely on high-quality and reproducible bulk RNA-seq data for scientific advancement. However, from our experience, knowledge about and application of quality control measures in RNA-seq datasets are often lacking. Here, we introduce Rup (RNA-seq usability assessment pipeline) for the quality control of bulk RNA-seq data, for subsequent gene expression analyses, that is stand-alone and readily applicable for wet-lab biologists with basic knowledge of R. Rup helps to discriminate between sequencing data of high-quality, suitable for downstream gene expression experiments, and those unsuitable for general further analysis. Rup includes tests for several commonly encountered problems such as insufficient read numbers or mapping, identification of contaminations, quantification of rRNA fractions in the total RNA-seq data, replicate similarity testing, using real data for demonstration and offering intuitive visualization. Rup provides a suite of tools to identify experimental shortcomings before standardized transcriptome analysis, thereby improving data quality for individual researchers and the field. This enhances confidence in bulk RNA-seq data analysis and provides a foundation for future guidelines defining minimum quality control criteria, thus improving the reliability and transparency of published RNA-seq data. Rup and test data are available at https://github.com/oliverrupp/rup.

Introduction

Transcriptomics (RNA-seq) experiments comprehensively interrogate transcriptional signatures shaping phenotypes. This approach became irreplaceable in plant molecular genetics to identify single or co-expressed genes and biological processes involved, for example, in development, plant pathogen interaction, or abiotic stress resistance1, 2 ,3. Recent advancements in RNA-seq technology have increased specificity and enabled the detection of different isoforms and variants at single-base resolution, allowing the identification of sequence variations from larger indels to single-nucleotide polymorphisms (SNPs). The data obtained by RNA-seq are characterized by a broad dynamic range, facilitating the detection of both highly abundant and lowly expressed transcripts, and require appropriate experimental conditions for consistency. In addition, RNA-seq datasets are large and thus computationally intensive to analyze and store, entailing efficient data management and extensive computing resources. RNA-seq requires high-quality input at every step of the workflow, as weaknesses in any stage can propagate and compromise the results. Poor RNA integrity or technical issues during library preparation will lead to bias and a reduction of accuracy. Parameters for obtaining high-quality raw reads can vary between next-generation sequencing (NGS) facilities. 

According to our experience, we recommend using only RNA with an RNA integrity number (RIN) above 7, which is indicative of largely intact mRNA structure, as input for RNA-seq library preparation. For successful sequencing, approximately 2 µg of total RNA at a concentration of 50–200 ng/µL is required for standard library preparation protocols. RNA purity should be confirmed by an OD260/280 ratio between 1.8 and 2.1 and an OD260/230 ratio greater than 1.5 using a spectrophotometer. Insufficient sequencing depth, low mapping rates, or misalignment to the reference genome may further distort gene expression and splicing quantification. Moreover, inadequate experimental design, such as a lack of discrimination between samples of different tissues or treatments and high variability between replicates, can introduce noise and diminish reproducibility. While many labs analyze transcriptomes routinely, stringent quality controls of the first essential analysis steps are often not reported or may be entirely absent from publications. This may lead to over-interpretation of results derived from transcriptome analysis and, consequently, to irreproducible results. 

Here, we provide a workflow for quality control of the initial steps required for high-quality transcriptome analyses of mRNA profiles to measure transcriptional alterations. Our aim is to enable wet-lab biologists with limited knowledge in bioinformatics to evaluate their primary transcriptomics data. Rup (Figure 1) is accessible to researchers who are familiar with basic knowledge of R. Executing the workflow presented here will provide researchers with a detailed understanding of their primary data, including their potential limitations for subsequent analysis. To our knowledge, a practical primary transcriptome data evaluation pipeline in combination with guidelines for distinguishing high- from low-quality data is lacking so far. 

Figure 1
Figure 1: Workflow of the RNA-seq in silico quality control pipeline. The input files for quality control are derived from RNA-seq data generated by sequencing plant material, as well as publicly available datasets. The provided R pipeline assesses sequence quality through three main approaches: sequencing quality, mapping quality, and replicate quality. Various quality metrics are computed, and statistical results are visualized using common R packages such as ggplot2 and pheatmap (e.g., bar charts and heatmaps). Please click here to view a larger version of this figure.

Previously reported evaluation pipelines required pre-processed data (e.g., read alignment as .bam files), do not cover all metrics, or are no longer maintained4,5,6. The advantage of the workflow presented here lies in its comprehensiveness by consolidating the most common quality control issues into a single pipeline. Further, we provide examples for high-quality data suitable for all downstream analysis applications, but also examples for low-quality data, and discuss their specific limitations for further analysis. Several previously published reports describe the purpose of RNA-seq analysis tools and provide comparative evaluations of their performance7,8. However, RNA-seq quality control and methodology reporting standards are not standardized and aggravate reproducibility and biologically meaningful interpretations of transcriptomics experiments. 

Rup integrates standard high-quality tools, quality control analysis, and visualization of the results. Rup requires raw sequencing reads and an annotated genome as input and runs on Mac OS, Linux, and “Windows Subsystem for Linux” (WSL) on Windows systems. It integrates tools for the quality of the sequencing and quantifies read mapping to a genome, includes rRNA content measurement in the samples, and offers sample correlation to evaluate replicate correlation. The test data were obtained using laser microdissection of the central meristem tissue of two different stages of the plant species Eschscholzia californica, an ultra-low input protocol for library preparation, and sequenced on Novaseq 6000. 

Protocol

Rup can be run as a single script with minimum input files required. The pipeline, as shown here, requires Illumina paired-end RNA sequencing data. An adjusted pipeline for single-end sequencing with the same analysis steps is also deposited in the GitHub repository. Only the genome sequence as a fasta file, the gene model and rRNA annotations as gtf files, and the raw sequencing reads as fastq.gz files are needed. Make sure that the genome and annotation files are provided as genome.fa, annotation.gtf, and rRNA.gtf in a folder specified in the reference_folder variable. When all sequencing files as .fq.gz files are saved in the read_file_folder variable, Rup can be run as follows:

1. Preparations:

  1. Install all necessary R packages using Bioconductor.

    BiocManager::install(c("getopt", "ggplot2", "reshape2", "pheatmap", "fastqcr", "Rfastp", "Rsubread", "Rsamtools"))

  2. Set up all necessary files.
    1. Create a source folder that will contain all necessary files. Add the reference genome sequence in fasta file called “reference/genome.fa” to the source folder. Provide the gene model annotation as a gtf file: “reference/annotation.gtf”, Optionally, provide the rRNA gene annotation as gtf file: “reference/rRNA.gtf”.
    2. Add all sequencing reads as fastq files to the “reads” folder. Ensure that all fastq file names follow the same pattern; <SAMPLENAME>_1.fastq.gz and <SAMPLENAME>_2.fastq.gz for the forward and reverse reads.

      # define source folder and all derived subfolders for input and output files
      source_folder <- "data"
      reference_folder <- file.path(source_folder, "reference")
      read_file_folder <- file.path(source_folder, "reads")
      results_folder <- file.path(source_folder, "results")

      # define input reference files
      genome_fasta_file <- file.path(reference_folder, "genome.fa")
      annotation_file <- file.path(reference_folder, "annotation.gtf")
      rrna_file <- file.path(reference_folder, "rRNA.gtf")

      # define result subfolders
      fastqc_folder <- file.path(results_folder, "fastqc")
      trimmed_fastqc_folder <- file.path(results_folder, "trimmed_fastqc")
      trimmed_read_folder <- file.path(results_folder, "trimmed")
      bam_folder <- file.path(results_folder, "bam")
      sorted_bam_folder <- file.path(results_folder, "sorted_bam")
      counts_folder <- file.path(results_folder, "counts")

      # create results folders
      for(folder in c(results_folder, fastqc_folder, trimmed_fastqc_folder, trimmed_read_folder, bam_folder, sorted_bam_folder, counts_folder)) {
       if(!dir.exists(folder)) {
        dir.create(folder)
       }
      }

      # get sample prefixes from input fastq files
      fastq_files <- list.files(read_file_folder, pattern = "*_1.f(ast)?q.gz")
      sample_prefixes <- gsub("_1.f(ast)?q.gz", "", fastq_files)

  3. Set the parameter according to the computing resources.

    n_threads <- 8 # the number of available CPU cores
    bamSortMemory <- "1024" # maximum memory for bam file sorting

  4. Set the parameter according to the sequencing method.
     

    MinReadLength <- 25 # minimum read length, should not be less than 25
    minFragLength <- 0 # minimum fragment length distribution
    maxFragLength <- 300 # maximum fragment length distribution
    orientation <- "fr" # read orientation for read mapping ("fr", "rf", "ff")
    stranded <- 0 # stranded sequencing
             # (0 (unstranded), 1 (stranded) and 2 (reversely stranded))



    NOTE: The minimum read length should not be smaller than 25 bp, because smaller reads may cause the mapping to break. Larger values will result in more uniquely mapped reads, but also in more discarded reads during trimming, depending on the sequencing quality. More details on these parameters are described in the supplements.

2. Sequencing quality assessment

NOTE: This step will create a bar plot showing the number of reads before and after trimming each sample.

  1. Run fastqc9, which is a tool for general quality control of high-throughput sequence data, providing information on the total number of sequenced reads, the read length, mean GC content, sequencing adapter contamination, and overrepresented sequences (e.g., rRNA reads). Analyze the sequencing quality as per base and per sequence quality reports. Run fastqc on the raw sequencing files using the R package fastqcr10 and place all input files in the same directory (read_file_folder). Summarize the output files from fastqc_folder into an R object using the function qc_aggregate.
    NOTE: The results will be saved in the directory fastqc_folder. The .html files created by fastqc in the output folder can be accessed for additional statistics.
     

    # load the fastqc library
    library(fastqcr)
    # run fastqc on all raw fastq files
    fastqc(fq.dir=read_file_folder, qc.dir=fastqc_folder, threads=n_threads)

    # aggregate the fastqc statistics
    qc <- qc_aggregate(fastqc_folder, progress=F)
    qc$tot.seq <- as.numeric(qc$tot.seq)
    # remove suffixes from sample names
    qc$sample = gsub(".f(ast)?q.gz", "" , qc$sample)

  2. With all sample prefixes stored in the vector sample_prefixes, iterate over the prefixes and run the quality trimming tool fastp on all samples. Trim to remove adapter sequences, primers, low-quality bases, poly-A/T sequences and is executed by Fastp11, which is available in the R package Rfastp12. The trimmed fastq files will be stored in the folder trimmed_read_folder.
     

    # load the rfastp library
    library(Rfastp)

    # iterate over sample prefixes
    for(prefix in sample_prefixes) {
     # create output prefix
     # !!! Rfastp automatically adds _R1.fastq.gz and _R2.fastq.gz to the prefix

    outputPrefix <- file.path(trimmed_read_folder, prefix)

    # get input reads based on sample prefix
    read1 = file.path(read_file_folder, paste(prefix, "_1.fastq.gz", sep=""))
    read2 = file.path(read_file_folder, paste(prefix, "_2.fastq.gz", sep=""))
    if(!file.exists(read1)) { read1 = file.path(read_file_folder, paste(prefix, "_1.fq.gz", sep="")) }
    if(!file.exists(read2)) { read2 = file.path(read_file_folder, paste(prefix, "_2.fq.gz", sep="")) }

    # run fastp trimmig with minimum read length
    fastp_stats <- rfastp(read1 = read1,
         read2 = read2,
         minReadLength = minReadLength,
         outputFastq = outputPrefix,
         thread = n_threads)
    }

  3. Run fastqc again on the trimmed reads.
     

    # run fastqc on the trimmed reads
    fastqc(fq.dir=trimmed_read_folder, qc.dir=trimmed_fastqc_folder, threads=n_threads)

    # aggreagate the fastqc statistics
    qc_trimmed <- qc_aggregate(trimmed_fastqc_folder, progress=F)
    qc_trimmed$tot.seq <- as.numeric(qc_trimmed$tot.seq)
    # correct sample names (change the fastp "R1","R2" suffix to "1","2")
    qc_trimmed$sample = gsub("_R([12])$", "_\1" , qc_trimmed$sample)

  4. Evaluate the trimming by collecting the number of reads before and after trimming for all samples. Create a bar plot with the read numbers using ggplot213.
     

    # load the ggplot2 library for plotting
    library(ggplot2)

    # add the trimming status to the fastqc results
    qc$Trimming = "raw"
    qc_trimmed$Trimming = "trimmed"

    # combine the results into one vector
    qc_all = rbind(qc, qc_trimmed)

    # plot the read number before and after trimming as barplot
    read_number_plot <- ggplot(qc_all, aes(x=sample, y=tot.seq, fill=Trimming)) +
    geom_bar(stat="identity", position = "dodge") +
    xlab("Samples") + ylab("Number of Reads") +
    theme(axis.text.x = element_text(angle = 90, hjust = 0)) +
    theme(text = element_text(size = 18)) +
    ggtitle("Number of reads before and afer trimming")
    print(read_number_plot)

Figure 2
Figure 2: Sequencing and trimming results. Read counts before (red) and after trimming (green). Sample s2_r1 already has a low read count prior to trimming, whereas trimming removed a large fraction of sample s2_r2 reads. All other samples show acceptable loss of reads from trimming. Please click here to view a larger version of this figure.

3. Mapping quality

NOTE: This step calculates bar plots to discriminate single-mapped reads (e.g., transcripts from protein-coding genes) from multi mapped reads (e.g., rRNA reads) and unmapped reads (e.g., contaminations).

  1. Use the Rsubread14 package to map the reads to the reference genome. First, build an index of the reference genome fasta file (genome_fasta_file).
    NOTE: This step is performed once for each reference genome.
     

    # load the Rsubread library for read mapping and read counting
    library(Rsubread)


    # create a subread index from the reference genome sequence
    buildindex(file.path(reference_folder,"subread.index"), genome_fasta_file)

  2. Iterate over all samples and align the reads to the reference genome using the align() function of the Rsubread package. This function creates .bam files in the output folder.
     

    # load the Rsamtools library for bam file sorting and indexing
    library(Rsamtools)

    # iterate over sample names
    for(prefix in sample_prefixes) {
    # create the bam output file name
    output_bam = file.path(bam_folder, paste(prefix, ".bam", sep=""))

    # align the reads to the reference genome index
    mapping_stats <- align(index=file.path(reference_folder,"subread.index"),
         # create the forward and reverse read file names based on the prefix
         # "_R1.fastq.gz" and "_R2.fastq.gz" are forced by fastp

         readfile1=file.path(trimmed_read_folder, paste(prefix, "_R1.fastq.gz", sep="")),
         readfile2=file.path(trimmed_read_folder, paste(prefix, "_R2.fastq.gz", sep="")),
         output_file=output_bam, # set the output file name
         type=0, # the reads are RNA-seq
         minFragLength=minFragLength, # the minimal allowed fragment length
         maxFragLength=maxFragLength, # the minimal allowed fragment length
         PE_orientation=orientation, # read orientation
         nthreads=n_threads,
         # use a GTF annotation file to support the mapping
         useAnnotation=TRUE,
         annot.ext=annotation_file,
         isGTF=TRUE,
         nBestLocations=2) # to distinguish between unique and multi-mapping reads

    # create the sorted output file name (the .bam suffix will be added automatically)

    output_sorted_bam = file.path(sorted_bam_folder, prefix)

    # sort and index the bam file
    sortBam(output_bam, output_sorted_bam, maxMemory=bamSortMemory, nThreads=n_threads)
    indexBam(paste0(output_sorted_bam, ".bam"))
    }

  3. Count the reads for each gene using the featureCounts() function from the Rsubread package. Make sure that the annotation file (annotation_file) is in GTF format. Count only reads with a single match to the genome.
     

    # collect all bam files
    bam_files <- list.files(bam_folder, pattern = "*.bam$")

    # count the number of reads for each gene
         gene_feature_counts = featureCounts(file.path(bam_folder, bam_files),
         annot.ext = annotation_file, # the gene models
         isGTFAnnotationFile = TRUE,
         countMultiMappingReads = FALSE, # only count unique reads
         strandSpecific = stranded, # for strand specific data
         isPairedEnd = TRUE, # paired-end sequencing
    # the following paramter allow to control for false-positive read assignments

         requireBothEndsMapped = TRUE,
         checkFragLength = TRUE,
         minFragLength = minFragLength,
         maxFragLength = maxFragLength,
         nthreads = n_threads)

  4. Count the reads mapping to rRNA genes for assessing the rRNA content in the samples. Allow for multimapped reads to be counted here. Specify the rRNA gene loci in a GTF file (rrna_file).
     

    # count the reads mapping to rRNA genes
    rrna_feature_counts = featureCounts(file.path(bam_folder, bam_files),
         annot.ext = rrna_file, # the rRNA gene model file
         isGTFAnnotationFile = TRUE,
         countMultiMappingReads = TRUE, # rRNA reads usually map to multiple loci
         fraction = TRUE,
         strandSpecific = stranded,
         isPairedEnd = TRUE,
         nthreads = n_threads)

  5. Collect the read assignment statistics created by featureCounts. The statistics include the number of alignments for each sample in the different categories (e.g., Assigned, UnMapped, MultiMapped).
     

    # load the reshape2 library for data restructuring
    library(reshape2)

    # get the assignement stats
    gene_count_stats <- gene_feature_counts$stat
    # set the sample names as column names
    colnames(gene_feature_counts$counts) = gsub(".bam", "", colnames(gene_feature_counts$counts))

    # set the assignment type as row names
    rownames(gene_count_stats) <- gene_count_stats$Status
    # select the columns with the assignment statistics
    gene_count_stats <- gene_count_stats[,seq(2, ncol(gene_count_stats))]
    # set the sample names as column names
    colnames(gene_count_stats) <- gsub(".bam", "", colnames(gene_count_stats))

    # transform the dataframe for plotting
    transformed_stats <- melt(t(gene_count_stats))

    # set the column names
    colnames(transformed_stats) <- c("Sample", "Group", "Alignments")

    # adjust the groupa and sample ordering for plotting
    transformed_stats$Group <- factor(transformed_stats$Group,
              levels = rev(levels(transformed_stats$Group)[order(levels(transformed_stats$Group))]))
    transformed_stats$Sample <- factor(transformed_stats$Sample,
              levels = rev(levels(transformed_stats$Sample)[order(as.character(transformed_stats$Sample))]))

    # remove assignment classes without any alignments
    transformed_stats <- transformed_stats[transformed_stats$Alignments > 0,]
    # the statistics refer to proteins coding genes
    transformed_stats$Reference = "Genes"

  6. Collect the statistics of the rRNA assignment.
     

    # collect the rrna assignment statistics
    rrna_count_stats <- rrna_feature_counts$stat
    # set the sample names as column names
    colnames(rrna_count_stats) <- gsub(".bam", "", colnames(rrna_count_stats))

    # only the number of assigned reads are important in this case
    rrna_counts = as.data.frame(t(rrna_count_stats[rrna_count_stats$Status == "Assigned",2:ncol(rrna_count_stats)]))
    colnames(rrna_counts) = "Alignments"

    # setup names and categories for plotting
    rrna_counts$Sample = rownames(rrna_counts)
    rrna_counts$Group = "rRNA"
    rrna_counts$Reference = "rRNA"

  7. Plot the read mapping statistics as a bar plot.
     

    # combine the proteind coding and rRNA gene statistics
    mapping_stats = rbind(transformed_stats, rrna_counts)

    # plot the assignment statistics
    mapping_stats_plot = ggplot(data = mapping_stats, aes(x = Sample, y = Alignments)) +
     geom_col(aes(fill = Group), width = 0.7) +
     theme_bw() + facet_wrap(~Reference) +
     ylab("Number of Alignments") + xlab("Samples") +
     ggtitle("Read Mapping Numbers") +
     theme(text = element_text(size = 18)) +
     theme(axis.text.x = element_text(angle = 90, hjust = 0))
    print(mapping_stats_plot)

  8. Classify genes into groups according to the number of reads assigned to them and plot the results as a bar plot.
     

    # get the reads per gene counts
    gene_count_matrix = gene_feature_counts$counts
    # set the sample names as column names
    colnames(gene_count_matrix) = gsub(".bam", "", colnames(gene_count_matrix))

    # group and count genes in classes of specific read counts
    read_count_classes = data.frame("no_reads"=colSums(gene_count_matrix == 0),
         "at_least_1_read"=colSums(gene_count_matrix >= 1 & gene_count_matrix < 10),
         "at_least_10_reads"=colSums(gene_count_matrix >= 10 & gene_count_matrix < 100),
         "at_least_100_reads"=colSums(gene_count_matrix >= 100 & gene_count_matrix < 1000),
         "at_least_1000_reads"=colSums(gene_count_matrix >= 1000))

    # setup data.frame for plotting
    read_count_classes$Sample = rownames(read_count_classes)
    melt_rcc = melt(read_count_classes)
    melt_rcc$variable = as.character(melt_rcc$variable)

    # sort the gene groups
    melt_rcc$variable = factor(melt_rcc$variable, levels=c("no_reads",
             "at_least_1_read",
             "at_least_10_reads",
             "at_least_100_reads",
             "at_least_1000_reads"))

    # plot the data as bar plot
    gene_coverage_plot <- ggplot(melt_rcc, aes(x=Sample, y=value, fill=variable)) +
    geom_bar(stat="identity") +
    ylab("Number of Genes") + xlab("Samples") +
    ggtitle("Number of Reads per Gene") +
    guides(fill=guide_legend(title="Number of assigned reads")) +
    theme(text = element_text(size = 18)) +
    theme(axis.text.x = element_text(angle = 90, hjust = 0))
    print(gene_coverage_plot)

Figure 3
Figure 3: Read mapping overview and statistics. Most samples show a high number of assigned reads (left side, brown) and a low number of rRNA reads (right side, blue). Sample s2_r3 has an unusually high amount of multi-mapped reads (left side, green) corresponding to a high rRNA read number (right side). Sample s2_r4 shows a high number of unmapped reads in combination with an expected rRNA read number, suggestive of contamination with reads from another organism. Please click here to view a larger version of this figure.

4. Replicate quality.

  1. Ensure that replicates from the same sample or tissue show a higher correlation with each other than with replicates from different samples or tissue. Plot a clustered heatmap of the replicate correlations to visualize the data using either the raw read counts or the Transcripts Per Million (TPM)15 normalized counts.
     

    # load the pheatmap library
    library(pheatmap)

    # compute TPM values
    Length_kb = gene_feature_counts$annotation$Length / 1000 # get gene lengths in kb
    RPK = gene_feature_counts$counts / Length_kb # normalize read counts by gene length
    scaling_factors = colSums(RPK) / 1e6 # get scaling factor based on the sum normalized read counts
    TPM = RPK / scaling_factors # scale the normalized read counts to 1e6

    # plot the sample/replicate correlation heatmap
    # the pearson correlation is computed on the log2 transformed TPM values

    pheatmap(cor(log2(TPM+1)), fontsize = 18, main="Sample Correlation Heatmap")

Figure 4
Figure 4: Gene read count classification. All genes in a sample are classified into one of five categories based on the number of assigned reads. Shown in red are the number of genes without any reads assigned. Samples with a low number of total assigned reads (s2_r1 to s1_r4) have a higher number of genes with 10 to 100 assigned reads and a lower number of genes with more than 1000 reads. Genes with low expression may be missed in these samples. Please click here to view a larger version of this figure.

Representative Results

The pipeline is completely implemented as an R script and has been tested on Linux and Mac OS operating systems. Windows users can use Windows Subsystem for Linux (WSL). The code and test data are available as a GitHub repository: https://github.com/oliverrupp/rup. The sequencing data are available under the ENA EBI project PRJEB96400.

Ten samples were created artificially from two real samples to exemplify diverse problems that may be encountered during quality control of bulk RNA-seq using Rup for analysis. Sample s2_r1 was designed to exhibit a low total read number, sample s2_r2 contains a large fraction of low-quality reads to be discarded by the trimming process. Sample s2_r3 includes a large fraction of rRNA reads and sample s2_r4 includes contaminant reads that failed to map to the reference genome. The names of samples s1_r5 and s2_r5 were swapped to illustrate low replicate correlation.

Protocol section 2 enables the identification of samples with low read numbers prior to or following trimming (Figure 2). The bar plot shows the lower read number in sample s1_r1 both before and after trimming. Here, the initial read number was low. The low read number of sample s1_r2 after trimming suggests a large amount of adapter sequences, sequencing errors, primer sequences, poly-A/T stretches sequenced that were removed during the trimming process. Degradation of input RNA may also reduce the number of high-quality reads.

Protocol section 3 identifies problems in read assignments. Figure 3 depicts the elevated number of multi-mapped reads in sample s2_r3 (green), as well as the high number of rRNA reads. Contamination of sample s2_r4 is apparent by the large fraction of reads not mapped to the reference genome (pink). These reads do not correlate to rRNA but to sequences from a non-target organism. Figure 4 shows general problems associated with low numbers of assigned reads in transcriptomes. In samples with a low rate of reads mapped uniquely to the reference genome (s2_r1 to r4) only about a third of the genes have more than 100 reads assigned whereas in the other samples, about half of the genes fall into these classes. Reads of genes with very low expression may not be found in the samples s2_r1 to r4, and consequently, comparative expression analysis with these samples will be highly unreliable and should be avoided.

Protocol section 4 can be used to identify replicate outliers. Replicates of the same sample/condition/tissue are expected to show a higher correlation between each other than to replicates from other samples/conditions/tissues. Figure 5 shows a correlation heatmap of the two samples (S1 and S2) with five replicates each. The dendrograms on top and to the side of the plot show two clusters with five replicates in each cluster. The left cluster contains four replicates of sample 1 and one replicate of sample 2 (s2_r5), the right cluster contains four replicates of sample 2 and one replicate of sample 1 (s1_r5). In this case, when the sample names s1_r5 and s2_r5 are swapped back again, each cluster contains all replicates of one sample, which may indicate a replicate labeling mistake. Other reasons for replicates not clustering together could be a lack of differentiation between the samples/conditions/tissues, or clustering of replicates for sequencing quality reasons only. The latter may occur when all replicates with exceptionally low or those with exceptionally high read counts after trimming and mapping form a cluster.

Figure 5
Figure 5: Sample correlation heatmap. The sample correlation heatmap is based on log2 transformed TPM values and shows two distinct clusters of five samples each. Biological/technical replicates are expected to show a higher correlation between each other than those from other tissues/treatments. The left cluster contains four replicates of sample 1 and one replicate of sample 2 (s2_r5), the right cluster contains four replicates of sample 2 and one replicate of sample 1 (s1_r5). The single replicates should be checked for possible sample swapping or batch effects to explain their clustering in the heatmap. Please click here to view a larger version of this figure.

Table 1: Comparison of RNA-seq quality assessment pipelines. For a detailed analysis, see Supplemental File 1. Please click here to download this Table.

Supplemental File 1: Parameter and reference selection. The analyses show the influence of parameter and reference selection on the overall QC results. Please click here to download this File.

Discussion

The quality of differential gene expression analyses highly depends on two factors: the number of reads sequenced in each sample and replicate16 and the number of replicates per sample17,18. Here, we present the user-friendly Rup pipeline to determine the number of reads suitable for gene expression quantification in each replicate. Different metrics allow researchers to understand why replicates show low assigned read numbers and to identify problems in sample correlation. Although Rup was developed for quality assessment of plant RNA-seq samples, it is equally suitable for other eukaryotic organisms, requiring no further adjustments for non-plant samples (Supplemental File 1).

The first step of Rup determines the total read count before and after read trimming. The number of sequenced reads determines the number of detectable genes, and if the read number is too low, many differentially expressed genes will remain undetected. A large fraction of reads discarded during the quality trimming and filtering process might indicate RNA degradation, which could lead to an underestimation of expression for highly degraded genes. However, whether a sample can be used for downstream applications depends on the target organism and the aim of the study. For example, to capture the expression of most plant genes, in our experience, 30 million to 50 million reads are required, whereas for fungi, only 10 million might be sufficient. Thus, Rup will not define quality thresholds for exclusion of problematic samples but rather provide metrics to identify diverse problems that may be encountered.

The second step (mapping quality) evaluates the accuracy and reliability of read alignment to the reference genome, specifying their suitability for downstream analyses. Not all sequenced reads can be used for gene abundance computation; reads that fail to align to the genome or transcriptome or reads that map to multiple locations on the genome are ignored in most cases19 and do not contribute to the overall read counts. The results of the second step of the pipeline can be used to understand why reads are not used in the abundance computation. A high number of unmapped reads could indicate contamination during RNA extraction (e.g., with plant pathogens or herbivores) or an incomplete reference genome. Incomplete gene models of the reference genome may result in a high number of “no feature” reads. A large fraction of multi-mapped reads could be caused by a high number of rRNA genes in the library indicating insufficient rRNA removal during the sequencing library preparation process (in the case that the multi-mapped reads are truly derived from rRNAs, they can be analyzed further by providing an rRNA annotation file to the pipeline, which then automatically computes the number of possible rRNA reads).

The third, equally important quality metric is the correlation between replicates of the same sample. Generally, the correlation between replicates of the same sample should be higher than the correlation between replicates of different samples. A low correlation between replicates could indicate large biological variability between the replicates, high similarity between the samples/conditions/tissues, other batch effects, or even sample swapping or mislabeling. Rup computes pair-wise correlations between all samples and produces a clustered heatmap of the sample correlations. Additionally, a principal component analysis (PCA) can be computed on the samples to identify possible batch effects that need to be acknowledged in further downstream analyses. Although batch effects can be corrected in downstream analyses, it might be better to remove problematic samples if the measured difference is too high.

Input material has a direct impact on sequencing quality: tissue sampling, RNA extraction, and library preparation are critical steps for minimizing RNA-seq quality issues. Consistent conditions should be maintained, if possible, for example, by using growth chambers. Pest control measures should be carried out in a timely manner, as only healthy individuals should be selected for sampling. It is further advisable to collect samples on the same day and/or at the same time to reduce circadian transcriptome variation. Numerous RNA extraction kits are available, and selecting an appropriate kit for the target species can improve mRNA quality. The library preparation protocols should include steps for polyA+ RNA enrichment to minimize the rRNA fraction.

Rup can be used as an initial quality control step in differential gene expression analysis. This quality control is required for several critical reasons, as it ensures the quality of the input data (identifies low-quality replicates/samples, can set quality thresholds for read depth and mapping rates) and identifies technical issues such as sequencing errors, batch effects, or mislabeling. If the RNA input is of sufficient quality, Rup can assist by trimming low-quality reads or identifying mislabeled samples. However, Rup cannot compensate for poor input RNA quality. In cases of low-quality RNA or erroneous sequencing data, it may be necessary to re-collect samples, adjust the RNA extraction protocol, and/or repeat sequencing. The same applies to samples with a large fraction of unmapped reads that may have been caused by contamination. However, from our experience, not all RNA extractions can be simply repeated due to a lack of material availability. In this case, they are unsuited for some downstream applications: samples with low numbers of single-mapped reads should not be analyzed in differential gene expression analyses, but they still hold information for the analysis of transcript presence. In this case, the absence of transcripts in tissues/treatments/conditions cannot be considered for the analysis and quantification of transcript abundance.

Rup includes some limitations. For example, it does not directly test for RNA degradation as direct measurements of RNA integrity are required before library preparation and sequencing. However, a script to identify RNA degradation using the RSeQC modules geneBodyCoverage.py and tin.py is included in the repository of this pipeline. Furthermore, Rup does not test for GC content and transcript length bias. The reference genome size is currently limited to 4 Gb, and according to our experience, the read mapping module overestimates multi-mapped reads in polyploids, as exemplified in the Supplemental File in a comparison of read mapping to a haploid vs. the diploid genome versions of E. californica, a haploid genome is thus the preferred input for this pipeline. The Rup output quality largely depends on the quality of the reference genome and annotation, such that an incomplete or highly fragmented genome sequence could lead to an overestimation of unmapped reads. Moreover, incomplete gene model annotation leads to an overestimation of unassigned reads. Completeness and duplication of the genome sequence and the gene annotation can be inferred with tools like BUSCO20.

Rup is a stand-alone tool for all initial quality control steps essential in RNA-seq experiments. Unlike RSeQC and RNA-SeQC, preprocessing of the raw sequencing reads is not required. Sequencing quality analysis cannot be done with RNA-SeQC, and sample correlation heatmaps are not calculated by RSeQC and RNA-SeQC (Table 1). Further, the normalization output is in FPKM in RSeQC and output visualization is not implemented as a standard for all modules in RSeQC and RNA-SeQC. Thus, several quality control issues are not tested in the two alternative tools, which include low read number, high fraction of trimmed reads and the identification of replicate outliers. A comparison of Rup, RSeQC and RNA-SeQC is available in Supplemental File 1. Moreover, Rup can be used complementary to RNA-SeQC or RSeQC, for example, the sorted BAM files produced by this pipeline can be used as input for these tools.

Currently, Rup is optimized for a small number of samples, but the most time-consuming steps, like quality trimming and read mapping, can be precomputed, for example, on a computer cluster or a cloud infrastructure. For genomes larger than 4 Gb, this is mandatory, since the Rsubread aligner is limited to genomes smaller than 4 Gb. rRNA annotation is done with barrnap, which identifies the highly conserved rRNA genes. According to our experience, rRNA gene annotation does not require comprehensiveness, because even if some abnormal rRNA genes are missing, a gross overview of rRNA presence in the dataset is sufficient for dataset quality assessment. Future versions of Rup might switch to a different mapping method, like the pseudo-alignment tool salmon21, to decrease runtime. Moreover, more normalization and correction methods, like GC bias or length bias correction, could be added to Rup.

In summary, Rup provides essential information to ensure the reliability and reproducibility of the RNA-seq data analysis. This pipeline comprehensively reports the main RNA-seq quality metrics and produces output files for direct use in downstream analyses. It was designed as a stand-alone tool for researchers with minimal bioinformatics knowledge to assess their primary sequencing data quality with intuitive visualization.

Disclosures

The authors have no conflicts of interest to declare.

Acknowledgements

We acknowledge technical assistance by the Bioinformatics Core Facility at the professorship of Systems Biology at JLU Giessen and provision of compute resources and general support by the BiGi service center (BMFB grant 031A533) within the de.NBI network. The work presented here was funded by the German Research Foundation (DFG) grant BE2547/24-1 to A.B., and we are also thankful for the support of the Justus Liebig University Giessen, Germany.

Materials

fastqcrR0.1.3
FUJITSU ESPRIMO D958 DesktopFUJITSUthe pipeline was developed and tested on Ubuntu Linux 24.02 (32 Gb RAM)
getoptR1.20.4
ggplot2R3.5.2
MacBook Pro Applethe pipeline was tested on maxOS 15.4.1 (16 Gb RAM)
pheatmapR1.0.13
R4.4.3
reshape2R1.4.4
rfastpbioconductor1.16.0
rsamtoolsbioconductor2.22.0
rsubreadbioconductor2.20.0

References

  1. Kivivirta, K. I., Herbert, D., Roessner, C., De Folter, S., Marsch-Martinez, N., Becker, A. Transcriptome analysis of gynoecium morphogenesis uncovers the chronology of gene regulatory network activity. Plant Physiol. 185 (3), 1076-1090 (2021).
  2. Hohenfeld, C. S., et al. Comparative analysis of infected cassava root transcriptomics reveals candidate genes for root rot disease resistance. Sci Rep. 14 (1), 10587 (2024).
  3. Li, Q., et al. Time-course transcriptomic information unravels the mechanisms of improved drought tolerance by drought-priming in wheat. J Integr Agric. 24 (8), 2902-2919 (2024).
  4. DeLuca, D. S., et al. RNA-SeQC: RNA-seq metrics for quality control and process optimization. Bioinformatics. 28 (11), 1530-1532 (2012).
  5. Wang, L., Wang, S., Li, W. RSeQC: Quality control of RNA-seq experiments. Bioinformatics. 28 (16), 2184-2185 (2012).
  6. Zhou, Q., Su, X., Jing, G., Chen, S., Ning, K. RNA-QC-chain: Comprehensive and fast quality control for RNA-seq data. BMC Genomics. 19 (1), 144 (2018).
  7. Deshpande, D., et al. RNA-seq data science: From raw data to effective interpretation. Front Genet. 14, 997383 (2023).
  8. Li, D., Zand, M. S., Dye, T. D., Goniewicz, M. L., Rahman, I., Xie, Z. An evaluation of RNA-seq differential analysis methods. PLoS One. 17 (9), e0264246 (2022).
  9. Kassambara, A. . fastqcr: Quality control of sequencing data. , (2023).
  10. Chen, S., Zhou, Y., Chen, Y., Gu, J. fastp: An ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 34 (17), i884-i890 (2018).
  11. Wang, W., Luo, J. D., Carroll, T. . Rfastp. , (2025).
  12. Wickham, H. . ggplot2: Elegant graphics for data analysis. , (2016).
  13. Shi, W. . Rsubread. , (2025).
  14. Wagner, G. P., Kin, K., Lynch, V. J. Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Theory Biosci. 131 (4), 281-285 (2012).
  15. Liu, Y., et al. Evaluating the impact of sequencing depth on transcriptome profiling in human adipose. PLoS One. 8, e66883 (2013).
  16. Schurch, N. J., et al. How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use. RNA. 22 (6), 839-851 (2016).
  17. Liu, Y., Zhou, J., White, K. P. RNA-seq differential expression studies: More sequence or more replication. Bioinformatics. 30 (3), 301-304 (2014).
  18. Deschamps-Francoeur, G., Simoneau, J., Scott, M. S. Handling multi-mapped reads in RNA-seq. Comput Struct Biotechnol J. 18, 1569-1576 (2020).
  19. Seppey, M., Manni, M., Zdobnov, E. M. BUSCO: Assessing genome assembly and annotation completeness. Methods Mol Biol. 1962, 227-245 (2019).
  20. Patro, R., Duggal, G., Love, M. I., Irizarry, R. A., Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods. 14 (4), 417-419 (2017).

Reprints and Permissions

Request permission to reuse the text or figures of this JoVE article
Request Permission
Rup (RNA-seq Usability Assessment Pipeline) – Quality Control for Bulk RNA-seq Experiments in Eukaryotes
JoVE logo
Contact Us Recommend to Library
Research
  • JoVE Journal
  • JoVE Encyclopedia of Experiments
  • JoVE Visualize
Business
  • JoVE Business
Education
  • JoVE Core
  • JoVE Science Education
  • JoVE Lab Manual
  • JoVE Quizzes
Solutions
  • Authors
  • Teaching Faculty
  • Librarians
  • K12 Schools
About JoVE
  • Overview
  • Leadership
Others
  • JoVE Newsletters
  • JoVE Help Center
  • Blogs
  • Site Maps
Contact Us Recommend to Library
JoVE logo

Copyright © 2025 MyJoVE Corporation. All rights reserved

Privacy Terms of Use Policies
WeChat QR code