Here, we present a protocol demonstrating the installation and use of a bioinformatics pipeline to analyze chimeric RNA sequencing data used in the study of in vivo RNA:RNA interactions.
An understanding of the in vivo gene regulatory interactions of small noncoding RNAs (sncRNAs), such as microRNAs (miRNAs), with their target RNAs has been advanced in recent years by biochemical approaches which use cross-linking followed by ligation to capture sncRNA:target RNA interactions through the formation of chimeric RNAs and subsequent sequencing libraries. While datasets from chimeric RNA sequencing provide genome-wide and substantially less ambiguous input than miRNA prediction software, distilling this data into meaningful and actionable information requires additional analyses and may dissuade investigators lacking a computational background. This report provides a tutorial to support entry-level computational biologists in installing and applying a recent open-source software tool: Small Chimeric RNA Analysis Pipeline (SCRAP). Platform requirements, updates, and an explanation of pipeline steps and manipulation of key user-input variables is provided. Reducing a barrier for biologists to gain insights from chimeric RNA sequencing approaches has the potential to springboard discovery-based investigations of regulatory sncRNA:target RNA interactions in multiple biological contexts.
Small noncoding RNAs are highly studied for their post-transcriptional roles in coordinating expression from suites of genes in diverse processes such as differentiation and development, signal processing, and disease1,2,3. The ability to accurately determine the target transcripts of gene-regulatory small noncoding RNAs (sncRNAs), including microRNAs (miRNAs), is of importance to studies of RNA biology at both basic and translational levels. Bioinformatic algorithms that exploit anticipated complementarity between the miRNA seed sequence and its potential targets have been frequently used for the prediction of miRNA:target RNA interactions. While these bioinformatic algorithms have been successful, they also can harbor both false positive and false negative results, as has been reviewed elsewhere4,5,6. Recently, several biochemical approaches have been designed and implemented that allow unambiguous and semiquantitative determination of in vivo sncRNA:target RNA interactions by in vivo crosslinking and ensuing incorporation of a ligation step to physically attach the sncRNA to its target to form a single chimeric RNA4,5,7,8,9,10. Subsequent preparation of sequencing libraries from the chimeric RNAs allows assessment of the sncRNA:target RNA interactions by computational processing of the sequencing data. This video provides a tutorial for installing and using a computational pipeline termed small chimeric RNA analysis pipeline (SCRAP), which is designed to allow robust and reproducible analysis of sncRNA:target RNA interactions from chimeric RNA sequencing libraries6.
A goal of this tutorial is to assist investigators in avoiding excessive reliance on purely predictive bioinformatic algorithms by lowering barriers to the analysis of data generated through biochemical approaches providing chimeric molecular readouts of sncRNA:target RNA interactions. This tutorial provides practical steps and tips to guide entry-level computational scientists through the use of a pipeline, SCRAP, developed for analyzing chimeric RNA sequencing data, which can be generated by several existing biochemical protocols, including crosslinking, ligation, and sequencing of hybrids (CLASH) and covalent ligation of endogenous Argonaute-bound RNAs- crosslinking and immunoprecipitation (CLEAR-CLIP)7,9.
The use of SCRAP offers several advantages for the analysis of chimeric RNA sequencing data, compared to other computational pipelines6. One salient advantage is its extensive annotation and the incorporation of call-outs to well-supported and routinely updated bioinformatic scripts within the pipeline, in comparison to alternative pipelines that often rely on custom and/or unsupported scripts for steps in the pipeline. This feature lends stability to SCRAP, making it more worthwhile for researchers to familiarize themselves with the pipeline and to incorporate its use into their workflow. SCRAP has also been demonstrated to outperform alternative pipelines in calling peaks of sncRNA:target RNA interactions and to have cross-platform functionality, as detailed in a prior publication6.
By the end of this tutorial, users will be able to (i) know platform requirements for SCRAP and install SCRAP pipelines, (ii) install reference genomes and set up command line parameters for SCRAP, and (iii) understand peak calling criteria and perform peak calling and peak annotation.
This video will describe in practical detail how researchers studying RNA biology may install and optimally use the computational pipeline, SCRAP, to analyze sncRNA interactions with target RNAs, such as messenger RNAs, in chimeric RNA-sequencing data obtained through one of the discussed biochemical approaches to sequencing library preparation.
SCRAP is a command line utility. Generally, following the guide below, the user will need to (i) download and install SCRAP (https://github.com/Meffert-Lab/SCRAP), (ii) Install reference genomes and run SCRAP, and (iii) perform peak calling and annotation.
Further details of the computational steps in this procedure can be found at https://github.com/Meffert-Lab/SCRAP. This article will provide the setup and background information to allow investigators with entry-level computational skills to install, optimize, and use SCRAP on chimeric RNA sequencing library datasets.
NOTE: The protocol will begin with downloading and installing software required to analyze chimeric RNA sequencing libraries using SCRAP.
1. Installation
2. Running SCRAP
3. Peak calling and annotation
4. Visualizing the data
NOTE: All steps for analysis using SCRAP are now completed. For visualizing the data, several approaches are recommended:
Results for sncRNA:target RNA detected by a modified version of SCRAP (SCRAP release 2.0, which implements modifications for rRNA filtering) on previously published sequencing datasets prepared using CLEAR-CLIP9 is shown in Figure 2 and Table 1. Users can appreciate the decrease in the relative fraction miRNA interactions with intron regions which occurs following the isolation of high-confidence interactions by peak calling in SCRAP. Additional data from analyses using SCRAP are also available in the initial publication of this pipeline6. Depending upon the experimental approach, filtering of sequencing data from prepared chimeric RNA libraries could be required to reduce artifacts in results. Suboptimal biochemical preparation of the sequencing library and/or suboptimal filtering of the sequencing data have the potential to result in the incorrect inclusion of reads that did not arise from the ligation of sncRNAs and target RNAs bound by Argonaute. These artifactual reads can include primer dimers or adapter dimers, rRNAs, and pre-miRNAs. Table 2 describes possible artifacts which may be detected in results, and potential solutions.
Figure 1: Formatting for data directories. Files containing raw reads for each sequencing library must be provided in the .fastq.gz format. (A) If the libraries are not paired-end, a single .fastq.gz file will be used in analysis. This file should be named 'SAMPLE.fastq.gz' where SAMPLE is the exact sample name provided by the user in the adapter file. The file should be contained within a folder matching the sample name exactly. (B) For paired-end sequencing libraries, two .fastq.gz files will be used. These files should be named 'SAMPLE-R1.fastq.gz' and 'SAMPLE-R2.fastq.gz' and should be located within a folder matching the sample name exactly. All such directories named SAMPLE should be located within the same parent directory, which the user will provide to SCRAP as the "sample directory". Please click here to view a larger version of this figure.
Figure 2: Proportion of miRNA:target RNA interactions by Target Type and Peak Calling methods. Chimeric sncRNA:target RNA sequencing published data from libraries prepared using CLEAR-CLIP (SRR2413277 – SRR2413295)9 were analyzed using a modified version of SCRAP (SCRAP release 2.0) with rRNA filtering implemented. Pre-miRNAs, tRNAs, and rRNAs were filtered, and distinct peak calling settings were used for 'high-confidence' (minimum 3 reads and 2 libraries) and 'all interactions' (minimum 1 read and 1 library). Interactions were grouped by miRNA family or ungrouped. Relative fractions of chimeric RNA reads for the categories (CDS, 5' UTR, intergenic, intron, 3'UTR) were calculated and graphed. Please click here to view a larger version of this figure.
All Interactions | High-Confidence Interactions | |||
Individual miRNAs | miRNA Families | Individual miRNAs | miRNA Families | |
CDS | 8675 | 8679 | 925 | 1046 |
5’ UTR | 338 | 338 | 38 | 43 |
Intergenic | 2230 | 2230 | 320 | 339 |
Intron | 9522 | 9519 | 382 | 406 |
3’ UTR | 6814 | 6813 | 548 | 644 |
Total Interactions: | 31033 | 31034 | 4219 | 4597 |
Table 1: Chimeric read Counts of miRNA:target RNA Interactions by Target Type and Peak Calling Method. Chimeric sncRNA:target RNA sequencing data published from libraries prepared using CLEAR-CLIP (SRR2413277 – SRR2413295)9 were analyzed using a modified version of SCRAP (SCRAP release 2.0) with rRNA filtering implemented. Pre-miRNAs, tRNAs, and rRNAs were filtered, and distinct peak calling settings were used for high-confidence (minimum 3 reads and 2 libraries) and all (minimum 1 read and 1 library) interactions, grouped by miRNA family or ungrouped. For each condition, counts of total detected miRNA:target RNA interactions in which the target RNA interaction was mapped to the category of coding sequence (CDS), 5' untranslated region (5' UTR), intergenic region, intron, or 3' untranslated region (3'UTR) are listed.
Potential Contaminant | Detected As | Causes | Potential Solutions | |||
Primer Dimers | Interactions detected between miRNAs whose sequence matches the 5’ end of an amplification primer and a target RNA whose sequence matches the remainder of the primer. | Improper size separation (i.e. gel extraction) of PCR product following amplification. | Most primer dimers will be disregarded by SCRAP following adapter removal due to their small length. If they persist, consider adding primer sequences to a filter. | |||
rRNAs | Interactions between arbitrary miRNAs and known rRNAs or lncRNAs Gm26917 and Gm42418 | Ineffective isolation (i.e. immunoprecipitation and gel separation) of Argonaute complexes. | rRNA filtering is often necessary when rRNA contamination is abundant. | |||
tRNAs and pre-miRNAs | Interactions between tRNA fragments that are degradation products of the same tRNA or 5p and 3p miRNAs produced from the same pre-miRNA. | Low abundance of true sncRNA:target RNA chimeras or low tissue Argonaute expression. | tRNA filtering and pre-miRNA filtering. |
Table 2: Potential contaminant sequencing reads and solutions.
This protocol on the use of SCRAP pipeline for analysis of sncRNA:target RNA interactions is designed to assist investigators who are entering into computational analysis. Completion of the tutorial is expected to guide investigators with entry-level or greater computational experience through the steps required for installation and use of this pipeline and its application to analyze data gained from chimeric RNA sequencing libraries. Steps critical to the completion of this protocol include correct reference installation and running of SCRAP, which can be time intensive and can be the source of errors, particularly if care was not taken during installation of dependencies using Anaconda or the typing of command line arguments.
Here, the particular focus has been on tips and steps for practical use of the SCRAP pipeline for analysis of chimeric sncRNA:target RNA sequencing libraries. SCRAP has been found to outperform other chimeric RNA analysis platforms in the detection of sncRNA:target RNA interactions6,13. This may be due to the peak calling feature of SCRAP which was developed specifically to detect the features (e.g. 3' shouldering) which are observed as a result of biochemical steps involved in the formation of the chimeric RNAs. Other peak calling methods for distinct biochemical approaches, such as downstream of chromatin immunoprecipitation sequencing (CHIP-seq) applications, have been developed to detect peaks in data which are symmetrically distributed around a mean and typically do not perform as well in detecting the peak features of chimeric sncRNA:target RNA libraries. Users may, however, wish to test the use of other computational pipelines that could work better for their needs, particularly if their data do not fit this description.
While SCRAP has minimal hardware requirements, SCRAP runtime scales poorly with dataset size. Investigators who are beyond the novice level, or who have extensive numbers of datasets or datasets with high sequencing coverage, may wish to use SCRAP in a manner that can speed the analysis steps. Since large datasets (usually, > 1 billion reads) require enhanced file storage capabilities and read/write speeds for data, running SCRAP on a High-Performance Computing (HPC) cluster may be desired for analysis of larger datasets. A SCRAP optimization, which should provide parallelization and improved performance will be made available on GitHub (https://github.com/Meffert-Lab/). This updated version of SCRAP (release 2.0) also has improved filters for rRNA and other contaminants.
As with any interface, users may inevitably encounter difficulties when using the command line interface. The most common of these include misspellings, incorrect paths, and package installation/versioning. Investigators are advised to exercise caution and avoid typos when writing command line arguments and to reproduce paths to files or folders exactly (use of a 'tab' autocompletion can help with this). Dependencies for SCRAP are managed via Anaconda so that investigators are less likely to encounter issues with package installation or version updates.
The authors have nothing to disclose.
We thank members of the Meffert laboratory for helpful discussions, including BH Powell and WT Mills IV, for critical feedback on describing the installation and implementation of the pipeline. This work was supported by a Braude Foundation award, the Maryland Stem Cell Research Fund Launch Program, the Blaustein Endowment for Pain Research and Education award, and NINDS RO1NS103974 and NIMH RO1MH129292 to M.K.M.
Genomes | UCSC Genome browser | N/A | https://genome.ucsc.edu/ or https://www.ncbi.nlm.nih.gov/data-hub/genome/ |
Linux | Linux | Ubuntu 20.04 or 22.04 LTS recommended | |
Mac | Apple | Mac OSX (>11) | |
Platform setup | GitHub | N/A | https://github.com/Meffert-Lab/SCRAP/blob/main/PLATFORM-SETUP.md] |
SCRAP pipeline | GitHub | N/A | https://github.com/Meffert-Lab/SCRAP |
Unix shell | Unix operating system | bash >=5.0 | |
Unix shell | Unix operating system | zsh (5.9 recommended) | |
Windows | Windows | WSL Ubuntu 20.04 or 22.04 LTS |