mirMachine: A One-Stop Shop for Plant miRNA Annotation

H. Busra Cagirici; Taner Z. Sen; Hikmet Budak

doi:10.3791/62430

Biology

mirMachine: A One-Stop Shop for Plant miRNA Annotation

Published: May 1, 2021 doi: 10.3791/62430

H. Busra Cagirici¹, Taner Z. Sen¹, Hikmet Budak²

¹U.S. Department of Agriculture - Agricultural Research Service, Western Regional Research Center, Crop Improvement and Genetics Research Unit, CA, USA, ²Montana BioAgriculture Inc., Missoula, MT, USA

Summary

Herein, we present a new and fully automated miRNA pipeline, mirMachine that 1) can identify known and novel miRNAs more accurately and 2) is fully automated and freely available. Users can now execute a short submission script to run the fully automated mirMachine pipeline.

Abstract

Of different types of noncoding RNAs, microRNAs (miRNAs) have arguably been in the spotlight over the last decade. As post-transcriptional regulators of gene expression, miRNAs play key roles in various cellular pathways, including both development and response to a/biotic stress, such as drought and diseases. Having high-quality reference genome sequences enabled identification and annotation of miRNAs in several plant species, where miRNA sequences are highly conserved. As computational miRNA identification and annotation processes are mostly error-prone processes, homology-based predictions increase prediction accuracy. We developed and have improved the miRNA annotation pipeline, SUmir, in the last decade, which has been used for several plant genomes since then.

This study presents a fully automated, new miRNA pipeline, mirMachine (miRNA Machine), by (i) adding an additional filtering step on the secondary structure predictions, (ii) making it fully automated, and (iii) introducing new options to predict either known miRNA based on homology or novel miRNAs based on small RNA sequencing reads using the previous pipeline. The new miRNA pipeline, mirMachine, was tested using The Arabidopsis Information Resource, TAIR10, release of the Arabidopsis genome and the International Wheat Genome Sequencing Consortium (IWGSC) wheat reference genome v2.

Introduction

Advances in next generation sequencing technologies have widened the understanding of RNA structures and regulatory elements, revealing functionally important non-coding RNAs (ncRNAs). Among different types of ncRNAs, microRNAs (miRNAs) constitute a fundamental regulatory class of small RNAs with a length between 19 and 24 nucleotides in plants¹^,². Since the discovery of the first miRNA in the nematode Caenorhabditis elegans³, the presence and the functions of miRNAs have been studied extensively in animal and plant genomes as well⁴^,⁵^,⁶. miRNAs function by targeting mRNAs for cleavage or translational repression⁷. Accumulating evidence has also shown that miRNAs are involved in a wide range of biological processes in plants including growth and development⁸, self-biogenesis⁹, and several biotic and abiotic stress responses¹⁰.

In plants, miRNAs are initially processed from long primary transcripts called pri-miRNAs¹¹. These pri-miRNAs generated by RNA polymerase II inside the nucleus are long transcripts forming an imperfect fold-back structure¹². The pri-miRNAs later undergo a cleavage process to produce endogenous single-stranded (ss) hairpin precursors of miRNAs called pre-miRNAs¹¹. The pre-miRNA forms a hairpin-like structure wherein a single strand folds into a double-stranded structure to excise an miRNA duplex (miRNA/miRNA*)¹³. Dicer-like protein cuts both strands of the miRNA/miRNA* duplex, leaving 2-nucleotide 3'-overhangs¹⁴^,¹⁵. The miRNA duplex is methylated inside the nucleus, which protects the 3'-end of the miRNA from degradation and uridylation activity¹⁶^,¹⁷. A helicase unwinds the methylated miRNA duplex after export and exposes the mature miRNA to the RNA-induced silencing complex (RISC) in the cytosol¹⁸. One strand of the duplex is mature miRNA incorporated into RISC , whereas the other strand, miRNA*, is degraded. The miRNA-RISC complex binds to the target sequence leading to either mRNA degradation in case of full complementarity or translational repression in case of partial complementarity¹³.

Based on the expression and biogenesis features, guidelines for miRNA annotation have been described¹⁵^,¹⁹. With the defined guidelines, Lucas and Budak developed the SUmir pipeline to perform a homology-based in silico miRNA identification in plants⁹. The SUmir pipeline was composed of two scripts: SUmirFind and SUmirFold. SUmirFind performs similarity searches against known miRNA datasets through National Center for Biotechnology Information (NCBI) Basic Local Alignment Search tool (BLAST) screening with modified parameters to include hits with only 2 or fewer mismatches and to avoid bias towards shorter hits (blastn-short -ungapped -penalty -1 -reward 1). SUmirFold evaluates the secondary structure of the putative miRNA sequences from BLAST²⁰ results using UNAfold²¹. SUmirFold differentiates miRNAs from small interfering RNAs by the identification of the characteristics of hairpin structure. Moreover, it differentiates miRNAs from other ssRNAs such as tRNA and rRNA by the parameters, minimum fold energy index > 0.67 and GC content of 24-71%. This pipeline has been recently updated by adding two additional steps to (i) increase sensitivity, (ii) increase annotation accuracy, and (iii) provide genomic distribution of the predicted miRNA genes²². Given the high conservation of plant miRNA sequences²³, this pipeline was originally designed for homology-based miRNA prediction. Novel miRNAs, however, could not be accurately identified with this bioinformatics analysis as it heavily relied on sequence conservation of miRNAs between closely related species.

This paper presents a new and fully automated miRNA pipeline, mirMachine that 1) can identify known and novel miRNAs more accurately (for example, the pipeline now uses sRNA-seq-based novel miRNA predictions as well as homology-based miRNA identification) and 2) is fully automated and freely available. The outputs have also included the genomic distributions of the predicted miRNAs. mirMachine was tested for both homology-based and sRNA-seq-based predictions in wheat and Arabidopsis genomes. Although initially released as free software, UNAfold became a commercial software in the last decade. With this upgrade, the secondary structure prediction tool was switched from UNAfold to RNAfold so that mirMachine can be freely available. Users can now execute a short submission script to run the fully automated mirMachine pipeline (examples are provided at https://github.com/hbusra/mirMachine.git).

Subscription Required. Please recommend JoVE to your librarian.

Protocol

1. Software dependencies and installation

Install software dependencies from their home site or using conda.
1. Download and install Perl, if it is not already installed, from its home site (https://www.perl.org/get.html).
  NOTE: Represented results were predicted using Perl v5.32.0.
2. Download Blast+, an alignment program, from its home site (https://www.ncbi.nlm.nih.gov/books/NBK279671/) as an executable and as source code.
  NOTE: Represented results were predicted using the BLAST 2.6.0+.
3. Install precompiled package of RNAfold from https://www.tbi.univie.ac.at/RNA/.
4. Alternatively, install these softwares using the following conda: i) conda install -c bioconda blast; ii) conda install -c bioconda viennarna.

2. The mirMachine setup and testing

Download the latest version of the mirMachine scripts and the mirMachine submission script from GitHub, https://github.com/hbusra/mirMachine.git, and then set the scripts path into the PATH.
Use the test data provided at the GitHub to make sure that the mirMachine along with all its dependencies have been downloaded correctly.
Run the mirMachine on the test data shown below.
bash mirMachine_submit.sh -f iwgsc_v2_chr5A.fasta -i mature_high_conf_v22_1.fa.filtered.fasta -n 10
NOTE: Set the -n option to 10 as the test data contains only one chromosome of the wheat genome. At defaults, the -n option is set to 20.
Control the hairpins.tbl.out.tbl output files for the predicted mature miRNAs, their predicted precursors, and their locations on the chromosomes.
Check the log files for the program outputs and warnings.

3. Homology-based miRNA identification

Run the mirMachine using the bash script shown below:
bash mirMachine_submit.sh -f $genome_file -i $input_file -m $mismatches -n $number_of_hits
Check the predicted miRNAs. Find the output file named $input_file.results.tbl.hairpins.tbl.out.tbl for the predicted miRNAs. Find the output file named $input_file.results.tbl.hairpins.fsa for the pre-miRNA FASTA sequences. Find the output file named $input_file.results.tbl.hairpins.log for the hairpin log file.

4. Novel miRNA identification

Preprocess the sRNA-seq FASTQ files into proper FASTA format. Trim adaptors if needed. Do not trim low-quality reads; instead, remove them. Remove reads containing N. Convert the FASTQ file into FASTA file ($input_file).
Run the mirMachine using the bash script shown below.
bash mirMachine_submit.sh -f $genome_file -i $input_file -n $number_of_hits -sRNAseq -lmax $lmax -lmin $lmin -rpm $rpm
NOTE: $mismatches was set to 0 for sRNA-seq based predictions.
Check the predicted miRNAs. Find the output file named $input_file.results.tbl.hairpins.tbl.out.tbl for the predicted miRNAs. Find the output file named $input_file.results.tbl.hairpins.fsa for the pre-miRNA FASTA sequences. Find the output file named $input_file.results.tbl.hairpins.log for the hairpin log file.

5. Advance parameters

NOTE: The defaults are defined for all the parameters except for the genome file and the input miRNA file.

Set the -db option to a blast database to skip the building reference database within the pipeline.
Set the -m option to the number of mismatches allowed.
NOTE: At defaults, -m option was set to 1 for homology-based predictions and 0 for the sRNA-seq-based predictions.
Set the -n to the number of hits to eliminate after alignment (default to 20). Change this based on the species.
Use the -long to assess the secondary structures for the suspect list.
Use the -s to activate the novel miRNA prediction based on sRNA-seq data.
Set the -lmax option to the maximum length of the sRNA-seq reads to include in the screening.
Set the -lmax option to the minimum length of the sRNA-seq reads to include in the screening.
Use the -rpm option to set the Reads Per Million (RPM) threshold.
NOTE: For advanced parameters like the length of pri-miRNAs/pre-miRNAs, experienced users are encouraged to modify the scripts for their research of interest. Additionally, if the users intend to skip some steps or prefer to use modified outputs, the submission script can be modified by simply adding # at the beginning of the lines to skip those lines.

Subscription Required. Please recommend JoVE to your librarian.

Representative Results

The miRNA pipeline, mirMachine, described above was applied to the test data for the fast evaluation of the performance of the pipeline. Only the high-confidence plant miRNAs deposited at miRBase v22.1 were screened against the chromosome 5A of IWGSC wheat RefSeq genome v2²⁴. mirMachine_find returned 312 hits for the nonredundant list of 189 high-confidence miRNAs with a maximum of 1 mismatch allowed (Table 1). mirMachine_fold classified 49 of them as putative miRNAs depending on the secondary structure evaluation. The highest represented group of miRNAs was miR9666 with a total of 18 miRNAs identified (Figure 1). Some miRNAs shared the same mature miRNA, but processed from a different pre-miRNA sequence. These miRNAs were renamed by the miRNA family name followed by a unique number, e.g., miR156-5p-1 and miR156-5p-2. Among the 49 putative miRNAs, 20 non-redundant mature miRNA sequences were identified. Some miRNAs can be transcribed from more than one locus resulting in a higher number of miRNAs represented. In the test data, miR9666-3p-5 was represented twice: one on the sense strand (at 602887137) and the other on the antisense strand (at 542053079). All locations are provided in the GitHub under the TestData output file named mature_high_conf_v22_1.fa.filtered.fasta.results.tbl. hairpins.tbl.out.tbl.

Expression evidence in one plant genome is sufficient, given the conservation of miRNAs in plants; however, a high-confidence miRNA dataset only provides a limited amount of data. Therefore, it is the user's preference to use the high-confidence and/or experimentally validated miRNAs as the reference dataset and skip the expression validation step, or to use all plant miRNAs available as the reference dataset and look for the expression evidence afterwards. Here, as the high-confidence miRNAs were used as the reference set, which had been validated experimentally in one of the plant genomes, the expression validation step was skipped for the test data.

mirMachine was benchmarked using monocot and dicot plants including Arabidopsis thaliana (Arabidopsis, TAIR10 release) and Triticum aestivum (wheat, IWGSC RefSeq v2). The performance of the homology-based and the sRNA-seq-based predictions was evaluated, and the results were compared with the miRDP2²⁵, an NGS-based miRNA prediction tool. Homology-based predictions were executed using the non-redundant list of plant mature miRNA sequences deposited at the miRbase v22²⁶. sRNA-seq-based predictions were executed using the publicly available datasets; GSM2094927 for Arabidopsis and GSM1294661 for the wheat. In addition to raw results, the homology-based predictions were filtered for the expression evidence of mature miRNA and miRNA star sequences using the same sRNA-seq datasets.

Figure 2 shows the performance of each tool and the mirMachine settings on the two species. Sensitivity was calculated as the total number of known miRNAs identified divided by the total number of miRNAs identified. The results showed that mirMachine outperformed miRDP2 in terms of sensitivity and the true positive predictions in the Arabidopsis data. For the wheat data, mirMachine homology-based prediction, supported by expression evidence, provided better sensitivity than miRDP2. For both the genomes, miRDP2 predicted higher number of true positives compared to mirMachine sRNA-seq and homology-based predictions with expression evidence. It should be noted that miRDP2 lowers the expression threshold (RPM, reads per million) from 10 to 1 for the prediction of known miRNAs, resulting in higher true positive predictions. In general, the mirMachine can be used for the identification of both novel and known miRNAs. One advantage of the mirMachine is its ability to predict genome-wide distribution of the putative miRNAs without a limitation of specific tissues and conditions. Finally, the mirMachine is user-friendly and provides flexibility to adjust parameters such as number of hits, mismatches, length of miRNAs, and RPMs for specific research purposes. Taken together, the mirMachine provides accurate predictions for the putative miRNAs in the transcriptomes and the genomes of the plants.

Figure 1: The distribution of miRNA families identified from the chromosome 5A of the IWGSC wheat reference genome v2. Data labels show the miRNA family and the number of miRNAs belonging to each miRNA family. Abbreviations: miRNA = microRNA; IWGSC = International Wheat Genome Sequencing Consortium. Please click here to view a larger version of this figure.

Figure 2: Performance assessment of the mirMachine. Comparisons of the sensitivity and the total number of known miRNAs predicted (true positives) are shown for the mirMachine with homology-based and sRNA-seq-based predictions and the miRDP2 software. Abbreviation: miRNA = microRNA. Please click here to view a larger version of this figure.

Genome	Genome Size	Reference miRNA dataset	mirMachine_find hits	mirMAchine_fold hits	# of miRNA families
Test data	~0.7 Gb	189	312	49	9
Chr5A

Table 1: Statistics of the mirMachine. Test data are from the chromosome 5A of the IWGSC wheat reference genome v2. Abbreviations: miRNA = microRNA; IWGSC = International Wheat Genome Sequencing Consortium.

Subscription Required. Please recommend JoVE to your librarian.

Discussion

Our miRNA pipeline, SUmir, has been used for the identification of many plant miRNAs for the last decade. Here, we developed a new, fully automated, and freely available miRNA identification and annotation pipeline, mirMachine. Furthermore, a number of miRNA identification pipelines including, but not limited to the previous pipeline, were dependent on UNAfold software²¹, which became a commercial software over time, although once being freely available. This new and fully automated mirMachine is no longer dependent on the UNAfold; instead, the freely available RNAfold from the ViennaRNA package²⁷ is used for secondary structure prediction. Additionally, all scripts for the mirMachine were gathered in a bash script with adjustable parameters to make mirMachine a fully automated and freely available miRNA prediction and annotation tool.

The mirMachine benefited from the characteristics of plant miRNAs and their biogenesis. As opposed to animal pre-miRNAs, plant pre-miRNAs are variable in length and structural features¹⁵. Consequently, a criterion has been set for the identification of plant miRNAs depending on the characteristics of the miRNAs and their biogenesis¹⁵. No cut-off was set for the pre-miRNA length as the length of plant pre-miRNAs can vary remarkably and could be hundreds of nucleotides long. Instead, pri-miRNA structure folding, which was limited to ~700 bp in length, was first evaluated. Later, pre-miRNA sequence was predicted from the candidate pri-miRNA sequences and evaluated for proper folding statistics.

Many plant genomes, especially agronomically important cereals such as wheat and barley, possess highly repetitive genomes²⁸^,²⁹^,³⁰. Other than the high-repeat content, polyploidy is observed in some of these plants²⁴, introducing additional complexities to the in silico identification and characterization of the miRNA structures. The repeats are a major source for the production of siRNAs³¹, which resemble miRNAs in their mature forms; however, they differ in biogenesis and function³²^,³³. It is extremely difficult to eliminate siRNAs from the candidate miRNA lists. In fact, the most widely used miRNA database, the miRBase²⁶, has been reported to contain large numbers of siRNAs annotated falsely as miRNAs³⁴^,³⁵. Based on the differences in their biogenesis, the mirMachine filters the small RNAs that form a perfect pair with the antisense strand as siRNAs and places those sequences into the suspect table. Additionally, the mirMachine has the -n option, which defines the maximum number of hits to filter the candidate RNAs as siRNAs.

Expression evidence is required to validate all the miRNAs predicted in silico. As miRNAs are highly conserved among plant genomes, expression evidence in one of the plant genomes should be sufficient to confirm the validity of the predicted miRNA. The use of high-confidence, mature miRNA sequences in the initial screening process has the advantage of providing expression evidence for all the predicted miRNAs; however, the short list of initial miRNA dataset limits the prediction of a comprehensive set of miRNAs in a genome. Alternatively, a full set of plant miRNAs deposited in the miRBase database can be used as an initial dataset instead of filtering for high-confidence miRNAs. Users are advised to look for expression evidence through expressed sequence tags, miRNA microarrays, or small RNA sequencing data for at least one of the plant genomes if any expression data are not available for the species of interest.

Homology-based miRNA predictions can help elucidate genome-wide distribution of the known family of miRNAs. These miRNAs are likely to be expressed in certain tissues and conditions. A drawback of homology-based predictions is the lack of ability to identify novel miRNA families. In contrast, sRNA-seq-based predictions could identify novel miRNAs with a cost of a high number of false positives. Therefore, the choice of the best approach is up to the users and the research of interest. The mirMachine presented here can help identification of the miRNAs based on either homology to known miRNAs or sRNA sequencing.

Subscription Required. Please recommend JoVE to your librarian.

Materials

Name	Company	Catalog Number	Comments
https://www.ncbi.nlm.nih.gov/books/NBK279671/			Blast+
https://github.com/hbusra/mirMachine.git			mirMachine submission script
https://www.perl.org/get.html			Perl
https://www.tbi.univie.ac.at/RNA/			RNAfold
Arabidopsis TAIR10
Triticum aestivum (wheat, IWGSC RefSeq v2)