November 7th, 2025
This protocol allows initial quality control for RNA-seq experiments for wet-lab biologists with limited bioinformatics experience.
We developed innovative bioinformatics tools to simplify, automate, and integrate data analysis from high throughput experiments. We use high throughput sequencing, advanced bioinformatics software, and powerful computing infrastructure to enable systematic biological analysis. To begin, install all required R packages using the bioconductor package manager.
Create a source folder to organize the input files for the analysis. Add the reference genome sequence in FASTA format as ReferenceGenome. fa to this folder.
Add the gene model annotation file named ReferenceAnnotation. GTF to the same folder. Optionally, include the RRNA gene annotation as a GTF file named ReferenceRRNA.gtf.
Place all sequencing reads as compressed FASTQ files into the folder named Reads. Ensure that each file follows the naming format. Then set the analysis parameters according to the sequencing method used.
To map quality of the sequence, use the Rsubread package to build an index of the reference genome from the genome FASTA file. For each sample, use the align function to iterate and align sequencing reads to the reference genome. Store the resulting alignment files in the output folder in bam format.
Now use the feature counts function to count reads mapped to each gene. The annotation files should be in the GTF format. Ensure only reads with a single match to the genome are counted.
Count the reads that map to RRNA genes by using the feature counts function with the RRNA gene GTF file. Allow multi-mapped reads to be included in this count. Retrieve the read assignment statistics generated by the feature counts function for each sample.
These statistics include the number of reads categorized as assigned, unmapped, multi-mapped, and others. Collect the statistics for RRNA gene assignments separately. Then generate bar plots, visualizing the read mapping statistics from the previous steps.
Group genes based on the number of reads assigned to them. Plot the classification results as a bar plot. Sample S2R1 showed a low number of reads both before and after trimming.
The trimmed read count of sample S2R2 was visibly reduced compared to its raw read count, indicating removal of low quality reads during trimming. Mapping identified problems in the read assignments. Sample S2R3 exhibited a high number of multi mapped reads and an elevated amount of ribosomal RNA reads.
A large fraction of reads in sample S2R4 did not map to the reference genome, suggesting contamination with sequences from a non-target organism. Samples S2R1 through S2R4 showed fewer genes with more than 100 assigned reads. In the correlation heat map, sample S2R5 clustered with the replicates of sample S1 and sample S1R5 clustered with the replicates of sample S2, indicating a likely replicate labeling error.
We address missing quality control for RNA-Seq, ensuring reliable data assessment before downstream gene expression analysis. Our tool integrates multiple quality text, offering accessible, automated, and reproducible RNA-Seq assessment for biologists. Our tool enables exploring how RNA-Seq quality influences biological interpretation, paving the way for transparent and reproducible transcriptomics.
This protocol allows initial quality control for RNA-seq experiments for wet-lab biologists with limited bioinformatics experience. It integrates automated tools for systematic biological analysis, ensuring reliable data assessment before downstream gene expression analysis.