Summary

Computational Analysis Tutorial for Chimeric Small Noncoding RNA: Target RNA Sequencing Libraries

Published: December 01, 2023
doi:

Summary

Here, we present a protocol demonstrating the installation and use of a bioinformatics pipeline to analyze chimeric RNA sequencing data used in the study of in vivo RNA:RNA interactions.

Abstract

An understanding of the in vivo gene regulatory interactions of small noncoding RNAs (sncRNAs), such as microRNAs (miRNAs), with their target RNAs has been advanced in recent years by biochemical approaches which use cross-linking followed by ligation to capture sncRNA:target RNA interactions through the formation of chimeric RNAs and subsequent sequencing libraries. While datasets from chimeric RNA sequencing provide genome-wide and substantially less ambiguous input than miRNA prediction software, distilling this data into meaningful and actionable information requires additional analyses and may dissuade investigators lacking a computational background. This report provides a tutorial to support entry-level computational biologists in installing and applying a recent open-source software tool: Small Chimeric RNA Analysis Pipeline (SCRAP). Platform requirements, updates, and an explanation of pipeline steps and manipulation of key user-input variables is provided. Reducing a barrier for biologists to gain insights from chimeric RNA sequencing approaches has the potential to springboard discovery-based investigations of regulatory sncRNA:target RNA interactions in multiple biological contexts.

Introduction

Small noncoding RNAs are highly studied for their post-transcriptional roles in coordinating expression from suites of genes in diverse processes such as differentiation and development, signal processing, and disease1,2,3. The ability to accurately determine the target transcripts of gene-regulatory small noncoding RNAs (sncRNAs), including microRNAs (miRNAs), is of importance to studies of RNA biology at both basic and translational levels. Bioinformatic algorithms that exploit anticipated complementarity between the miRNA seed sequence and its potential targets have been frequently used for the prediction of miRNA:target RNA interactions. While these bioinformatic algorithms have been successful, they also can harbor both false positive and false negative results, as has been reviewed elsewhere4,5,6. Recently, several biochemical approaches have been designed and implemented that allow unambiguous and semiquantitative determination of in vivo sncRNA:target RNA interactions by in vivo crosslinking and ensuing incorporation of a ligation step to physically attach the sncRNA to its target to form a single chimeric RNA4,5,7,8,9,10. Subsequent preparation of sequencing libraries from the chimeric RNAs allows assessment of the sncRNA:target RNA interactions by computational processing of the sequencing data. This video provides a tutorial for installing and using a computational pipeline termed small chimeric RNA analysis pipeline (SCRAP), which is designed to allow robust and reproducible analysis of sncRNA:target RNA interactions from chimeric RNA sequencing libraries6.

A goal of this tutorial is to assist investigators in avoiding excessive reliance on purely predictive bioinformatic algorithms by lowering barriers to the analysis of data generated through biochemical approaches providing chimeric molecular readouts of sncRNA:target RNA interactions. This tutorial provides practical steps and tips to guide entry-level computational scientists through the use of a pipeline, SCRAP, developed for analyzing chimeric RNA sequencing data, which can be generated by several existing biochemical protocols, including crosslinking, ligation, and sequencing of hybrids (CLASH) and covalent ligation of endogenous Argonaute-bound RNAs- crosslinking and immunoprecipitation (CLEAR-CLIP)7,9.

The use of SCRAP offers several advantages for the analysis of chimeric RNA sequencing data, compared to other computational pipelines6. One salient advantage is its extensive annotation and the incorporation of call-outs to well-supported and routinely updated bioinformatic scripts within the pipeline, in comparison to alternative pipelines that often rely on custom and/or unsupported scripts for steps in the pipeline. This feature lends stability to SCRAP, making it more worthwhile for researchers to familiarize themselves with the pipeline and to incorporate its use into their workflow. SCRAP has also been demonstrated to outperform alternative pipelines in calling peaks of sncRNA:target RNA interactions and to have cross-platform functionality, as detailed in a prior publication6.

By the end of this tutorial, users will be able to (i) know platform requirements for SCRAP and install SCRAP pipelines, (ii) install reference genomes and set up command line parameters for SCRAP, and (iii) understand peak calling criteria and perform peak calling and peak annotation.

This video will describe in practical detail how researchers studying RNA biology may install and optimally use the computational pipeline, SCRAP, to analyze sncRNA interactions with target RNAs, such as messenger RNAs, in chimeric RNA-sequencing data obtained through one of the discussed biochemical approaches to sequencing library preparation.

SCRAP is a command line utility. Generally, following the guide below, the user will need to (i) download and install SCRAP (https://github.com/Meffert-Lab/SCRAP), (ii) Install reference genomes and run SCRAP, and (iii) perform peak calling and annotation.

Further details of the computational steps in this procedure can be found at https://github.com/Meffert-Lab/SCRAP. This article will provide the setup and background information to allow investigators with entry-level computational skills to install, optimize, and use SCRAP on chimeric RNA sequencing library datasets.

Protocol

NOTE: The protocol will begin with downloading and installing software required to analyze chimeric RNA sequencing libraries using SCRAP.

1. Installation

  1. Before installing SCRAP, install the dependencies Git and Miniconda on the machine to be used for the analyses. Git is likely already installed. On the Mac OSX platform, for example, verify this using which git to see that the "git" utility is present and installed in this directory. Check if Miniconda is installed using which conda. If nothing is returned, install Miniconda. Miniconda requires 400 MB of disk space to install.
    1. There are a few methods to install Miniconda, and they differ by platform. Refer to the PLATFORM-SETUP markdown file on the Meffert Lab GitHub repository [https://github.com/Meffert-Lab/SCRAP/blob/main/PLATFORM-SETUP.md] where there are further instructions for installing on Windows, MacOS, and Ubuntu. For Linux users, Linux has its own default package manager (apt). In the case specific to this study, use the command brew install Miniconda to install Miniconda using an existing package manager, brew.
      NOTE: 'Homebrew', termed 'brew' is an open-source software package management system that simplifies the installation of software on Apple's operating system, macOS.
    2. If conda is being installed for the first time, run conda init for the particular shell that is in use. In the example here, that shell in use is zsh. Then, close and re-open the shell. If conda was successfully installed, the base environment activated within the terminal session will be seen.
  2. Download the SCRAP source and install its dependencies.
    1. The preferred method for obtaining SCRAP source is using Git. Access this by running git clone https://github.com/Meffert-Lab/SCRAP to obtain the latest copy of the source code.
    2. Install mamba, an improved package solver for conda, and install all the dependencies for SCRAP from SCRAP_environment.yml to its own conda environment using the following commands:
      conda install -n base conda-forge::mamba
      mamba env create -f SCRAP/SCRAP_environment.yml -n SCRAP
  3. Next, run the reference installation for SCRAP. The arguments used in the reference installation will be specific to the organism whose sncRNA-mRNA interactions are being analyzed.
    bash SCRAP/bin/Reference_Installation.sh -r full/path/to/SCRAP/ -m hsa -g hg38 -s human
    1. Provide the directory of the SCRAP source folder for reference installation. Installation steps will then be performed using the files within the fasta and annotation folders. List the full path without any shorthand. End with a slash.
    2. Refer to the tables in README.md for the correct miRbase species abbreviations. The up-to-date reference genomes can be found at https://genome.ucsc.edu/ or https://www.ncbi.nlm.nih.gov/data-hub/genome/. In this example, hg38 will be used for the mouse GRCm38 genome.
    3. The currently included species for annotation are human, mouse, and worm. View the corresponding species.annotation.bed files in the annotation directory in the SCRAP source folder. If the use of a different species for analysis is desired, provide an annotation.bed file that follows the same naming scheme species.annotation.bed.

2. Running SCRAP

  1. Now that the dependencies and SCRAP are installed, – run the script SCRAP.sh
    bash SCRAP/bin/SCRAP.sh -d full/path/to/CLASH_Human/ -a full/path/to/CLASH_Human/CLASH_Human_Adapters.txt -p no -f yes -r full/path/to/SCRAP/ -m hsa -g hg38
    1. List the entire path to the sample directories without any shorthand. Format the sample directories with the folder name matching the sample name exactly, as shown in Figure 1.
    2. Note that the path listed is the path to the directory that contains all the sample folders, not the path to any individual sample folder or a sample file (refer to the command line in step 2.1).
    3. Next, list the entire path to the adapter file. Ensure that the sample names in the adapter file match the previously mentioned folder names and file names (refer to the command line in step 2.1).
    4. Indicate whether the samples are paired-end and whether or not filtering for pre-miRNAs and/or tRNAs will be performed. Add a filter for rRNA cleaning if desired (refer to the command line in step 2.1).
      NOTE: The users may or may not decide to use these filters depending on the sample types and experimental goals. Depending upon the experimental design, pre-miRNAs, tRNAs, and rRNAs can consume available sequencing depth for real sncRNA:target RNA chimeras and users can employ filters to exclude them. However, users may want to avoid such filtering in certain circumstances (e.g., mapping sncRNA targets to the mitochondrial genome, which contains mitochondrial rRNAs).
    5. Next, list the entire path to the reference directory, the miRbase abbreviation, and the reference genome abbreviation (refer to the command line in step 2.1).
      ​NOTE: The script may take a few hours to complete, depending on the dataset size and the CPU of the computer being used.

3. Peak calling and annotation

  1. Once SCRAP is finished running, check that the output includes, among other files, a SAMPLE.aligned.unique.bam file. This is a binary file containing alignments of target RNAs onto the user-provided reference genome.
  2. Now perform peak calling by running Peak_Calling.sh.
    bash SCRAP/bin/Peak_Calling.sh -d CLASH_Human/ -a CLASH_Human/CLASH_Human_Adapters.txt -c 3 -l 2 -f no -r SCRAP/ -m hsa -g hg38
    NOTE: Peak calling is a feature of SCRAP, which is designed to allow researchers to readily evaluate the most robust and reproducible small noncoding RNA:target RNA interactions within their chimeric RNA libraries. This feature, for example, can aid researchers in identifying interactions that they may want to select for further investigation. Step 3.2.2 below describes how the user sets the criteria which they want to be used to define the stringency with which a peak is called – this includes the number of unique interactions, or sequencing reads, which must have occurred for the peak to be called, as well as the number of libraries in which this particular interaction must have occurred.
    1. Again, list the full paths to the directory containing the sample folders, and the adapter file (refer to the command line in step 3.2).
    2. Next, set the minimum number of sequencing reads required for a peak to be called (refer to the command line in step 3.2).
    3. Set the minimum number of distinct sequencing libraries that must contain a peak for it to be called (refer to the command line in step 3.2).
      NOTE: The choice of values for both 3.2.2 and 3.2.3 will depend upon the nature of the samples sequenced and the number of samples or sample types. Here, at least 3 chimeric sequencing reads in a sample are required to call a peak, and the peak must be supported by at least 2 samples. An investigator evaluating a dataset in which there are many sequencing library replicates for a given condition, for example, might decide to require the presence of the reads in a greater number of sample sequencing libraries.
    4. Indicate whether sncRNAs of the same family must contribute to the same peak. For example, since miRNAs of the same family share seed sequences, these miRNAs can bind shared and overlapping sets of gene targets; a user might want to identify the full impact of a family on these targets by assessing their collective peaks(refer to the command line in step 3.2).
    5. Next, indicate the full path to the reference directory, the miRBase abbreviation, and the reference genome abbreviation (refer to the command line in step 3.2).
  3. Once peak calling is complete, run peak annotation.
    ​bash SCRAP/bin/Peak_Annotation.sh -p CLASH_Human/peaks.bed -r SCRAP/ -s human
    1. List the full path to the resulting peaks.bed (or peaks.family.bed) file from peak calling, the full path to the reference directory, and the desired species for annotation.

4. Visualizing the data

NOTE: All steps for analysis using SCRAP are now completed. For visualizing the data, several approaches are recommended:

  1. Merge all the .bam (binary SAM file) files that will be desired to visualize together (samtools merge).
  2. Sort the resulting merged .bam file (samtools sort). File contents are sorted line by line so that samtools may index.
  3. Index the sorted .bam file (samtools index). A BAI (binary samtools format index) file is generated to permit visualization in the integrative genomics viewer (IGV).
  4. Finally, open the resulting sorted .bam and indexed .bai file in IGV.
    NOTE: SncRNA:Target RNA interactions of interest may be prioritized for follow-up in a number of investigation-specific ways. One generic initial approach is to assess the interactions for which peaks are supported by the most chimeric sequencing reads. Interactions of interest may also be visualized using the DuplexFold Web Server from the RNAstructure package by inputting the sequence for both the sncRNA and the target RNA from the detected interaction11. For each peak, the chromosome (first column) and genomic coordinates (start: 1st column end: 2nd column) can be found within the peaks.bed.species.annotation.txt file generated in peak annotation. For miRNAs in particular, while reproducible and functional interactions can lack extensive seed-matched binding (e.g., interactions may use 3' compensatory binding), the presence of seed-matched sites in a cognate binding motif of the target RNA can nonetheless be assessed as a validating feature of functionally important detected interactions4,12. Ancillary data processing could include comparisons of differential read coverage between peaks in distinct biological conditions and potentially assessment of clustering of regulated genes into pathways using a pathway analysis tool.

Representative Results

Results for sncRNA:target RNA detected by a modified version of SCRAP (SCRAP release 2.0, which implements modifications for rRNA filtering) on previously published sequencing datasets prepared using CLEAR-CLIP9 is shown in Figure 2 and Table 1. Users can appreciate the decrease in the relative fraction miRNA interactions with intron regions which occurs following the isolation of high-confidence interactions by peak calling in SCRAP. Additional data from analyses using SCRAP are also available in the initial publication of this pipeline6. Depending upon the experimental approach, filtering of sequencing data from prepared chimeric RNA libraries could be required to reduce artifacts in results. Suboptimal biochemical preparation of the sequencing library and/or suboptimal filtering of the sequencing data have the potential to result in the incorrect inclusion of reads that did not arise from the ligation of sncRNAs and target RNAs bound by Argonaute. These artifactual reads can include primer dimers or adapter dimers, rRNAs, and pre-miRNAs. Table 2 describes possible artifacts which may be detected in results, and potential solutions.

Figure 1
Figure 1: Formatting for data directories. Files containing raw reads for each sequencing library must be provided in the .fastq.gz format. (A) If the libraries are not paired-end, a single .fastq.gz file will be used in analysis. This file should be named 'SAMPLE.fastq.gz' where SAMPLE is the exact sample name provided by the user in the adapter file. The file should be contained within a folder matching the sample name exactly. (B) For paired-end sequencing libraries, two .fastq.gz files will be used. These files should be named 'SAMPLE-R1.fastq.gz' and 'SAMPLE-R2.fastq.gz' and should be located within a folder matching the sample name exactly. All such directories named SAMPLE should be located within the same parent directory, which the user will provide to SCRAP as the "sample directory". Please click here to view a larger version of this figure.

Figure 2
Figure 2: Proportion of miRNA:target RNA interactions by Target Type and Peak Calling methods. Chimeric sncRNA:target RNA sequencing published data from libraries prepared using CLEAR-CLIP (SRR2413277 – SRR2413295)9 were analyzed using a modified version of SCRAP (SCRAP release 2.0) with rRNA filtering implemented. Pre-miRNAs, tRNAs, and rRNAs were filtered, and distinct peak calling settings were used for 'high-confidence' (minimum 3 reads and 2 libraries) and 'all interactions' (minimum 1 read and 1 library). Interactions were grouped by miRNA family or ungrouped. Relative fractions of chimeric RNA reads for the categories (CDS, 5' UTR, intergenic, intron, 3'UTR) were calculated and graphed. Please click here to view a larger version of this figure.

All Interactions High-Confidence Interactions
Individual miRNAs miRNA Families Individual miRNAs miRNA Families
CDS 8675 8679 925 1046
5’ UTR 338 338 38 43
Intergenic 2230 2230 320 339
Intron 9522 9519 382 406
3’ UTR 6814 6813 548 644
Total Interactions: 31033 31034 4219 4597

Table 1: Chimeric read Counts of miRNA:target RNA Interactions by Target Type and Peak Calling Method. Chimeric sncRNA:target RNA sequencing data published from libraries prepared using CLEAR-CLIP (SRR2413277 – SRR2413295)9 were analyzed using a modified version of SCRAP (SCRAP release 2.0) with rRNA filtering implemented. Pre-miRNAs, tRNAs, and rRNAs were filtered, and distinct peak calling settings were used for high-confidence (minimum 3 reads and 2 libraries) and all (minimum 1 read and 1 library) interactions, grouped by miRNA family or ungrouped. For each condition, counts of total detected miRNA:target RNA interactions in which the target RNA interaction was mapped to the category of coding sequence (CDS), 5' untranslated region (5' UTR), intergenic region, intron, or 3' untranslated region (3'UTR) are listed.

Potential Contaminant Detected As Causes Potential Solutions
Primer Dimers Interactions detected between miRNAs whose sequence matches the 5’ end of an amplification primer and a target RNA whose sequence matches the remainder of the primer. Improper size separation (i.e. gel extraction) of PCR product following amplification. Most primer dimers will be disregarded by SCRAP following adapter removal due to  their small length. If they persist, consider adding primer sequences to a filter.
rRNAs Interactions between arbitrary miRNAs and known rRNAs or lncRNAs Gm26917 and Gm42418 Ineffective isolation (i.e. immunoprecipitation and gel separation) of Argonaute complexes. rRNA filtering is often necessary when rRNA contamination is abundant.
tRNAs and pre-miRNAs Interactions between tRNA fragments that are degradation products of the same tRNA or 5p and 3p miRNAs produced from the same pre-miRNA. Low abundance of true sncRNA:target RNA chimeras or low tissue Argonaute expression. tRNA filtering and pre-miRNA filtering.

Table 2: Potential contaminant sequencing reads and solutions.

Discussion

This protocol on the use of SCRAP pipeline for analysis of sncRNA:target RNA interactions is designed to assist investigators who are entering into computational analysis. Completion of the tutorial is expected to guide investigators with entry-level or greater computational experience through the steps required for installation and use of this pipeline and its application to analyze data gained from chimeric RNA sequencing libraries. Steps critical to the completion of this protocol include correct reference installation and running of SCRAP, which can be time intensive and can be the source of errors, particularly if care was not taken during installation of dependencies using Anaconda or the typing of command line arguments.

Here, the particular focus has been on tips and steps for practical use of the SCRAP pipeline for analysis of chimeric sncRNA:target RNA sequencing libraries. SCRAP has been found to outperform other chimeric RNA analysis platforms in the detection of sncRNA:target RNA interactions6,13. This may be due to the peak calling feature of SCRAP which was developed specifically to detect the features (e.g. 3' shouldering) which are observed as a result of biochemical steps involved in the formation of the chimeric RNAs. Other peak calling methods for distinct biochemical approaches, such as downstream of chromatin immunoprecipitation sequencing (CHIP-seq) applications, have been developed to detect peaks in data which are symmetrically distributed around a mean and typically do not perform as well in detecting the peak features of chimeric sncRNA:target RNA libraries. Users may, however, wish to test the use of other computational pipelines that could work better for their needs, particularly if their data do not fit this description.

While SCRAP has minimal hardware requirements, SCRAP runtime scales poorly with dataset size. Investigators who are beyond the novice level, or who have extensive numbers of datasets or datasets with high sequencing coverage, may wish to use SCRAP in a manner that can speed the analysis steps. Since large datasets (usually, > 1 billion reads) require enhanced file storage capabilities and read/write speeds for data, running SCRAP on a High-Performance Computing (HPC) cluster may be desired for analysis of larger datasets. A SCRAP optimization, which should provide parallelization and improved performance will be made available on GitHub (https://github.com/Meffert-Lab/). This updated version of SCRAP (release 2.0) also has improved filters for rRNA and other contaminants.

As with any interface, users may inevitably encounter difficulties when using the command line interface. The most common of these include misspellings, incorrect paths, and package installation/versioning. Investigators are advised to exercise caution and avoid typos when writing command line arguments and to reproduce paths to files or folders exactly (use of a 'tab' autocompletion can help with this). Dependencies for SCRAP are managed via Anaconda so that investigators are less likely to encounter issues with package installation or version updates.

Divulgaciones

The authors have nothing to disclose.

Acknowledgements

We thank members of the Meffert laboratory for helpful discussions, including BH Powell and WT Mills IV, for critical feedback on describing the installation and implementation of the pipeline. This work was supported by a Braude Foundation award, the Maryland Stem Cell Research Fund Launch Program, the Blaustein Endowment for Pain Research and Education award, and NINDS RO1NS103974 and NIMH RO1MH129292 to M.K.M.

Materials

Genomes UCSC Genome browser N/A https://genome.ucsc.edu/ or https://www.ncbi.nlm.nih.gov/data-hub/genome/
Linux Linux Ubuntu 20.04 or 22.04 LTS recommended
Mac Apple Mac OSX (>11)
Platform setup GitHub N/A https://github.com/Meffert-Lab/SCRAP/blob/main/PLATFORM-SETUP.md]
SCRAP pipeline GitHub N/A https://github.com/Meffert-Lab/SCRAP
Unix shell Unix operating system bash >=5.0
Unix shell Unix operating system zsh (5.9 recommended)
Windows Windows WSL Ubuntu 20.04 or 22.04 LTS

Referencias

  1. Morris, K. V., Mattick, J. S. The rise of regulatory RNA. Nature Reviews Genetics. 15 (6), 423-437 (2014).
  2. Li, X., Jin, D. S., Eadara, S., Caterina, M. J., Meffert, M. K. Regulation by noncoding RNAs of local translation, injury responses, and pain in the peripheral nervous system. Neurobiology of Pain (Cambridge, Mass.). 13, 100119 (2023).
  3. Shi, J., Zhou, T., Chen, Q. Exploring the expanding universe of small RNAs. Nature Cell Biology. 24 (4), 415-423 (2022).
  4. Broughton, J. P., Lovci, M. T., Huang, J. L., Yeo, G. W., Pasquinelli, A. E. Pairing beyond the seed supports microRNA targeting specificity. Molecular Cell. 64 (2), 320-333 (2016).
  5. Grosswendt, S., et al. Unambiguous identification of miRNA:target site interactions by different types of ligation reactions. Molecular Cell. 54 (6), 1042-1054 (2014).
  6. Mills, W. T., Eadara, S., Jaffe, A. E., Meffert, M. K. SCRAP: a bioinformatic pipeline for the analysis of small chimeric RNA-seq data. RNA. 29 (1), 1-17 (2023).
  7. Helwak, A., Kudla, G., Dudnakova, T., Tollervey, D. Mapping the human miRNA interactome by CLASH reveals frequent noncanonical binding. Cell. 153 (3), 654-665 (2013).
  8. Hoefert, J. E., Bjerke, G. A., Wang, D., Yi, R. The microRNA-200 family coordinately regulates cell adhesion and proliferation in hair morphogenesis. Journal of Cell Biology. 217 (6), 2185-2204 (2018).
  9. Moore, M. J., Zhang, C., Gantman, E. C., Mele, A., Darnell, J. C., Darnell, R. B. Mapping Argonaute and conventional RNA-binding protein interactions with RNA at single-nucleotide resolution using HITS-CLIP and CIMS analysis. Nature Protocols. 9 (2), 263-293 (2014).
  10. Bjerke, G. A., Yi, R. Integrated analysis of directly captured microRNA targets reveals the impact of microRNAs on mammalian transcriptome. RNA. 26 (3), 306-323 (2020).
  11. Reuter, J. S., Mathews, D. H. RNAstructure: software for RNA secondary structure prediction and analysis. BMC Bioinformatics. 11 (1), 129 (2010).
  12. Moore, M. J., et al. miRNA-target chimeras reveal miRNA 3′-end pairing as a major determinant of Argonaute target specificity. Nature Communications. 6 (1), 8864 (2015).
  13. Travis, A. J., Moody, J., Helwak, A., Tollervey, D., Kudla, G. Hyb: a bioinformatics pipeline for the analysis of CLASH (crosslinking, ligation and sequencing of hybrids) data. Methods (San Diego, Calif.). 65 (3), 263-273 (2014).

Play Video

Citar este artículo
Eadara, S., Li, X., Eiss, E. A., Meffert, M. K. Computational Analysis Tutorial for Chimeric Small Noncoding RNA: Target RNA Sequencing Libraries. J. Vis. Exp. (202), e65779, doi:10.3791/65779 (2023).

View Video