Next-generation sequencing has become an important tool in molecular biology. Various protocols to investigate genomic, transcriptomic and epigenomic features across virtually all species and tissues have been devised. For most of these experiments, one of the first crucial steps of bioinformatic analysis is the mapping of reads to reference genomes.
Numerous high-throughput sequencing studies have focused on detecting conventionally spliced mRNAs in RNA-seq data. However, non-standard RNAs arising through gene fusion, circularization or trans-splicing are often neglected. We introduce a novel, unbiased algorithm to detect splice junctions from single-end cDNA sequences. In contrast to other methods, our approach accommodates multi-junction structures. Our method compares favorably with competing tools for conventionally spliced mRNAs and, with a gain of up to 40% of recall, systematically outperforms them on reads with multiple splits, trans-splicing and circular products. The algorithm is integrated into our mapping tool segemehl (http://www.bioinf.uni-leipzig.de/Software/segemehl/).
Eulimnogammarus verrucosus is an amphipod endemic to the unique ecosystem of Lake Baikal and serves as an emerging model in ecotoxicological studies. We report here on a survey sequencing of its genome as a first step to establish sequence resources for this species. From a single lane of paired-end sequencing data, we estimated the genome size as nearly 10 Gb and we obtained an overview of the repeat content. At least two-thirds of the genome are non-unique DNA, and a third of the genomic DNA is composed of just five families of repetitive elements, including low-complexity sequences. Attempts to use off-the-shelf assembly tools failed on the available low-coverage data both before and after removal of highly repetitive components. Using a seed-based approach we nevertheless assembled short contigs covering 33 pre-microRNAs and the homeodomain-containing exon of nine Hox genes. The absence of clear evidence for paralogs implies that a genome duplication did not contribute to the large genome size. We furthermore report the assembly of the mitochondrial genome using a new, guided "crystallization" procedure. The initial results presented here set the stage for a more complete sequencing and analysis of this large genome.
The chromosome 9p21 (Chr9p21) locus of coronary artery disease has been identified in the first surge of genome-wide association and is the strongest genetic factor of atherosclerosis known today. Chr9p21 encodes the long non-coding RNA (ncRNA) antisense non-coding RNA in the INK4 locus (ANRIL). ANRIL expression is associated with the Chr9p21 genotype and correlated with atherosclerosis severity. Here, we report on the molecular mechanisms through which ANRIL regulates target-genes in trans, leading to increased cell proliferation, increased cell adhesion and decreased apoptosis, which are all essential mechanisms of atherogenesis. Importantly, trans-regulation was dependent on Alu motifs, which marked the promoters of ANRIL target genes and were mirrored in ANRIL RNA transcripts. ANRIL bound Polycomb group proteins that were highly enriched in the proximity of Alu motifs across the genome and were recruited to promoters of target genes upon ANRIL over-expression. The functional relevance of Alu motifs in ANRIL was confirmed by deletion and mutagenesis, reversing trans-regulation and atherogenic cell functions. ANRIL-regulated networks were confirmed in 2280 individuals with and without coronary artery disease and functionally validated in primary cells from patients carrying the Chr9p21 risk allele. Our study provides a molecular mechanism for pro-atherogenic effects of ANRIL at Chr9p21 and suggests a novel role for Alu elements in epigenetic gene regulation by long ncRNAs.
Prokaryotic transcripts constitute almost always uninterrupted intervals when mapped back to the genome. Split reads, i.e., RNA-seq reads consisting of parts that only map to discontiguous loci, are thus disregarded in most analysis pipelines. There are, however, some well-known exceptions, in particular, tRNA splicing and circularized small RNAs in Archaea as well as self-splicing introns. Here, we reanalyze a series of published RNA-seq data sets, screening them specifically for non-contiguously mapping reads. We recover most of the known cases together with several novel archaeal ncRNAs associated with circularized products. In Eubacteria, only a handful of interesting candidates were obtained beyond a few previously described group I and group II introns. Most of the atypically mapping reads do not appear to correspond to well-defined, specifically processed products. Whether this diffuse background is, at least in part, an incidental by-product of prokaryotic RNA processing or whether it consists entirely of technical artifacts of reverse transcription or amplification remains unknown.
Telomerase is a ribonucleoprotein (RNP) enzyme essential for telomere maintenance and chromosome stability. While the catalytic telomerase reverse transcriptase (TERT) protein is well conserved across eukaryotes, telomerase RNA (TR) is extensively divergent in size, sequence, and structure. This diversity prohibits TR identification from many important organisms. Here we report a novel approach for TR discovery that combines in vitro TR enrichment from total RNA, next-generation sequencing, and a computational screening pipeline. With this approach, we have successfully identified TR from Strongylocentrotus purpuratus (purple sea urchin) from the phylum Echinodermata. Reconstitution of activity in vitro confirmed that this RNA is an integral component of sea urchin telomerase. Comparative phylogenetic analysis against vertebrate TR sequences revealed that the purple sea urchin TR contains vertebrate-like template-pseudoknot and H/ACA domains. While lacking a vertebrate-like CR4/5 domain, sea urchin TR has a unique central domain critical for telomerase activity. This is the first TR identified from the previously unexplored invertebrate clade and provides the first glimpse of TR evolution in the deuterostome lineage. Moreover, our TR discovery approach is a significant step toward the comprehensive understanding of telomerase RNP evolution.
The discovery of a living coelacanth specimen in 1938 was remarkable, as this lineage of lobe-finned fish was thought to have become extinct 70 million years ago. The modern coelacanth looks remarkably similar to many of its ancient relatives, and its evolutionary proximity to our own fish ancestors provides a glimpse of the fish that first walked on land. Here we report the genome sequence of the African coelacanth, Latimeria chalumnae. Through a phylogenomic analysis, we conclude that the lungfish, and not the coelacanth, is the closest living relative of tetrapods. Coelacanth protein-coding genes are significantly more slowly evolving than those of tetrapods, unlike other genomic features. Analyses of changes in genes and regulatory elements during the vertebrate adaptation to land highlight genes involved in immunity, nitrogen excretion and the development of fins, tail, ear, eye, brain and olfaction. Functional assays of enhancers involved in the fin-to-limb transition and in the emergence of extra-embryonic tissues show the importance of the coelacanth genome as a blueprint for understanding tetrapod evolution.
The Gram-negative plant-pathogenic bacterium Xanthomonas campestris pv. vesicatoria (Xcv) is an important model to elucidate the mechanisms involved in the interaction with the host. To gain insight into the transcriptome of the Xcv strain 85-10, we took a differential RNA sequencing (dRNA-seq) approach. Using a novel method to automatically generate comprehensive transcription start site (TSS) maps we report 1421 putative TSSs in the Xcv genome. Genes in Xcv exhibit a poorly conserved -10 promoter element and no consensus Shine-Dalgarno sequence. Moreover, 14% of all mRNAs are leaderless and 13% of them have unusually long 5-UTRs. Northern blot analyses confirmed 16 intergenic small RNAs and seven cis-encoded antisense RNAs in Xcv. Expression of eight intergenic transcripts was controlled by HrpG and HrpX, key regulators of the Xcv type III secretion system. More detailed characterization identified sX12 as a small RNA that controls virulence of Xcv by affecting the interaction of the pathogen and its host plants. The transcriptional landscape of Xcv is unexpectedly complex, featuring abundant antisense transcripts, alternative TSSs and clade-specific small RNAs.
High-throughput sequencing methods allow whole transcriptomes to be sequenced fast and cost-effectively. Short RNA sequencing provides not only quantitative expression data but also an opportunity to identify novel coding and non-coding RNAs. Many long transcripts undergo post-transcriptional processing that generates short RNA sequence fragments. Mapped back to a reference genome, they form distinctive patterns that convey information on both the structure of the parent transcript and the modalities of its processing. The miR-miR* pattern from microRNA precursors is the best-known, but by no means singular, example.
Small non-coding RNAs (ncRNAs) such as microRNAs, snoRNAs and tRNAs are a diverse collection of molecules with several important biological functions. Current methods for high-throughput sequencing for the first time offer the opportunity to investigate the entire ncRNAome in an essentially unbiased way. However, there is a substantial need for methods that allow a convenient analysis of these overwhelmingly large data sets. Here, we present DARIO, a free web service that allows to study short read data from small RNA-seq experiments. It provides a wide range of analysis features, including quality control, read normalization, ncRNA quantification and prediction of putative ncRNA candidates. The DARIO web site can be accessed at http://dario.bioinf.uni-leipzig.de/.
Fast seed-based alignment heuristics such as BLAST and BLAT have become indispensable tools in comparative genomics for all studies aiming at the evolutionary relations of proteins, genes, and non-coding RNAs. This is true in particular for the large mammalian genomes. The sensitivity and specificity of these tools, however, crucially depend on parameters such as seed sizes or maximum expectation values. In settings that require high sensitivity the amount of short local match fragments easily becomes intractable. Then, fragment chaining is a powerful leverage to quickly connect, score, and rank the fragments to improve the specificity.
The advent of High Throughput Sequencing (HTS) methods opens new opportunities for the analysis of genomes and transcriptomes. While the sequencing of a whole mammalian genome took several years at the turn of this century, today it is only a matter of weeks. The race towards the thousand-dollar genome is fueled by the - ethically challenging - idea of personalized genomic medicine. However, these methods allow new and interesting insights in many aspects such as the discovery of novel noncoding RNA classes, structural variants, or alternative splice sites to name a few. Meanwhile, several methods for HTS have been introduced to the markets. Here, an overview on the technologies and the bioinformatics analysis of HTS data is given.
Many aspects of the RNA maturation leave traces in RNA sequencing data in the form of deviations from the reference genomic DNA. This includes, in particular, genomically non-encoded nucleotides and chemical modifications. The latter leave their signatures in the form of mismatches and conspicuous patterns of sequencing reads. Modified mapping procedures focusing on particular types of deviations can help to unravel post-transcriptional modification, maturation and degradation processes. Here, we focus on small RNA sequencing data that is produced in large quantities aimed at the analysis of microRNA expression. Starting from the recovery of many well known modified sites in tRNAs, we provide evidence that modified nucleotides are a pervasive phenomenon in these data sets. Regarding non-encoded nucleotides we concentrate on CCA tails, which surprisingly can be found in a diverse collection of transcripts including sub-populations of mature microRNAs. Although small RNA sequencing libraries alone are insufficient to obtain a complete picture, they can inform on many aspects of the complex processes of RNA maturation.
A synergistic combination of two next-generation sequencing platforms with a detailed comparative BAC physical contig map provided a cost-effective assembly of the genome sequence of the domestic turkey (Meleagris gallopavo). Heterozygosity of the sequenced source genome allowed discovery of more than 600,000 high quality single nucleotide variants. Despite this heterozygosity, the current genome assembly (?1.1 Gb) includes 917 Mb of sequence assigned to specific turkey chromosomes. Annotation identified nearly 16,000 genes, with 15,093 recognized as protein coding and 611 as non-coding RNA genes. Comparative analysis of the turkey, chicken, and zebra finch genomes, and comparing avian to mammalian species, supports the characteristic stability of avian genomes and identifies genes unique to the avian lineage. Clear differences are seen in number and variety of genes of the avian immune system where expansions and novel genes are less frequent than examples of gene loss. The turkey genome sequence provides resources to further understand the evolution of vertebrate genomes and genetic variation underlying economically important quantitative traits in poultry. This integrated approach may be a model for providing both gene and chromosome level assemblies of other species with agricultural, ecological, and evolutionary interest.
Genome sequencing of Helicobacter pylori has revealed the potential proteins and genetic diversity of this prevalent human pathogen, yet little is known about its transcriptional organization and noncoding RNA output. Massively parallel cDNA sequencing (RNA-seq) has been revolutionizing global transcriptomic analysis. Here, using a novel differential approach (dRNA-seq) selective for the 5 end of primary transcripts, we present a genome-wide map of H. pylori transcriptional start sites and operons. We discovered hundreds of transcriptional start sites within operons, and opposite to annotated genes, indicating that complexity of gene expression from the small H. pylori genome is increased by uncoupling of polycistrons and by genome-wide antisense transcription. We also discovered an unexpected number of approximately 60 small RNAs including the epsilon-subdivision counterpart of the regulatory 6S RNA and associated RNA products, and potential regulators of cis- and trans-encoded target messenger RNAs. Our approach establishes a paradigm for mapping and annotating the primary transcriptomes of many living species.
Interferon beta has been approved for the treatment of multiple sclerosis (MS). It is believed that immunomodulatory rather than antiviral activity of interferon beta is responsible for disease amelioration. The impact of interferon beta on the chemoattraction of immune cells has not been fully addressed.
Small polydispersed circular DNA (spcDNA) belongs to the extrachromosomal pool of DNA and is composed of heterogeneous DNA circles. Whether spcDNA has a special function is currently unclear but their occurrence was suggested to be linked to genetic instability. In this study we investigated as to whether human lymphocytes from healthy volunteers also harbour spcDNA and whether spcDNA is present in all permanent cell lines from human normal and malignant tissues. Moreover, we were interested to see whether spcDNA contains sequences of mobile genetic elements. Our results show that spcDNA is present in all samples investigated yet the amount is lower in normal lymphocytes when compared to cancer cell lines (5.4 vs. 17.8%). Alu sequences were present in 12/16 cancer cell lines whereas LINE-1 (L1) sequences were present in 15 of them. Six tumor cell lines also contained telomeric sequences. In contrast to that, spcDNA of normal lymphocytes contains Alu and L1 sequences only in 3/16 cases and no telomeric sequences at all. Our findings suggest a direct dependency of the amount of Alu and L1 sequences on that of spcDNA. Beside these repetitive sequences, sequencing of spcDNA revealed in most cases chromosomal sequences of almost all chromosomes without an increased frequency of single regions. We suggest that the whole spcDNA including retrotranspositional elements and telomeric sequences may play a role for chromosomal rearrangements and genomic instability.
MicroRNA-offset-RNAs (moRNAs) were recently detected as highly abundant class of small RNAs in a basal chordate. Using short read sequencing data, we show here that moRNAs are also produced from human microRNA precursors, albeit at quite low expression levels. The expression levels of moRNAs are unrelated to those of the associated microRNAs. Surprisingly, microRNA precursors that also show moRNAs are typically evolutionarily old, comprising more than half of the microRNA families that were present in early Bilateria, while evidence for moRNAs was found only for a relative small fraction of microRNA families of recent origin.
Vault RNAs (vtRNAs) are small, about 100 nt long, polymerase III transcripts contained in the vault particles of eukaryotic cells. Presumably due to their enigmatic function, they have received little attention compared with most other noncoding RNA (ncRNA) families. Their poor sequence conservation makes homology search a complex and tedious task even within vertebrates. Here we report on a systematic and comprehensive analysis of this rapidly evolving class of ncRNAs in deuterostomes, providing a comprehensive collection of computationally predicted vtRNA genes. We find that all previously described vtRNAs are located at a conserved genomic locus linked to the protocadherin gene cluster, an association that is conserved throughout gnathostomes. Lineage-specific expansions to small vtRNA gene clusters are frequently observed in this region. A second vtRNA locus is syntenically conserved across eutherian mammals. The vtRNAs at the two eutherian loci exhibit substantial differences in their promoter structures, explaining their differential expression patterns in several human cancer cell lines. In teleosts, expression of several paralogous vtRNA genes, most but not all located at the syntenically conserved protocadherin locus, was verified by reverse transcriptase-polymerase chain reaction.
With few exceptions, current methods for short read mapping make use of simple seed heuristics to speed up the search. Most of the underlying matching models neglect the necessity to allow not only mismatches, but also insertions and deletions. Current evaluations indicate, however, that very different error models apply to the novel high-throughput sequencing methods. While the most frequent error-type in Illumina reads are mismatches, reads produced by 454s GS FLX predominantly contain insertions and deletions (indels). Even though 454 sequencers are able to produce longer reads, the method is frequently applied to small RNA (miRNA and siRNA) sequencing. Fast and accurate matching in particular of short reads with diverse errors is therefore a pressing practical problem. We introduce a matching model for short reads that can, besides mismatches, also cope with indels. It addresses different error models. For example, it can handle the problem of leading and trailing contaminations caused by primers and poly-A tails in transcriptomics or the length-dependent increase of error rates. In these contexts, it thus simplifies the tedious and error-prone trimming step. For efficient searches, our method utilizes index structures in the form of enhanced suffix arrays. In a comparison with current methods for short read mapping, the presented approach shows significantly increased performance not only for 454 reads, but also for Illumina reads. Our approach is implemented in the software segemehl available at http://www.bioinf.uni-leipzig.de/Software/segemehl/.
Canonical microRNAs are excised from their hairpin-shaped precursors by Dicer. In order to find possible exceptions to this rule and to identify additional substrates for Dicer processing we re-evaluate the small RNA sequencing data of the Dicer knockdown experiment in MCF-7 cells orignally published by Friedländer et al. [Friedländer et al., 2012, Nucleic Acids Res 40:37-52]. While the well-known non-Dicer mir-451 is not sufficiently expressed in these experiments, there are several additional Dicer-independent microRNAs, among them the important tumor supressor mir-663a. We recover previously described examples of non-miRNA Dicer substrates such as tRNA-Gln and several snoRNAs. Interestingly, sdRNAs derived from box C/D snoRNAs are Dicer-independent, while those derived from box H/ACA snoRNAs are often Dicer dependent. Several pol-III transcripts, in particular the vault RNAs and the great ape specific snaRs are processed by Dicer, while the small RNAs originating from Y RNAs seem to be Dicer independent.
Burkitt lymphoma is a mature aggressive B-cell lymphoma derived from germinal center B cells. Its cytogenetic hallmark is the Burkitt translocation t(8;14)(q24;q32) and its variants, which juxtapose the MYC oncogene with one of the three immunoglobulin loci. Consequently, MYC is deregulated, resulting in massive perturbation of gene expression. Nevertheless, MYC deregulation alone seems not to be sufficient to drive Burkitt lymphomagenesis. By whole-genome, whole-exome and transcriptome sequencing of four prototypical Burkitt lymphomas with immunoglobulin gene (IG)-MYC translocation, we identified seven recurrently mutated genes. One of these genes, ID3, mapped to a region of focal homozygous loss in Burkitt lymphoma. In an extended cohort, 36 of 53 molecularly defined Burkitt lymphomas (68%) carried potentially damaging mutations of ID3. These were strongly enriched at somatic hypermutation motifs. Only 6 of 47 other B-cell lymphomas with the IG-MYC translocation (13%) carried ID3 mutations. These findings suggest that cooperation between ID3 inactivation and IG-MYC translocation is a hallmark of Burkitt lymphomagenesis.
Telomerase is a ribonucleoprotein with an intrinsic telomerase RNA (TER) component. Within yeasts, TER is remarkably large and presents little similarity in secondary structure to vertebrate or ciliate TERs. To better understand the evolution of fungal telomerase, we identified 74 TERs from Pezizomycotina and Taphrinomycotina subphyla, sister clades to budding yeasts. We initially identified TER from Neurospora crassa using a novel deep-sequencing-based approach, and homologous TER sequences from available fungal genome databases by computational searches. Remarkably, TERs from these non-yeast fungi have many attributes in common with vertebrate TERs. Comparative phylogenetic analysis of highly conserved regions within Pezizomycotina TERs revealed two core domains nearly identical in secondary structure to the pseudoknot and CR4/5 within vertebrate TERs. We then analyzed N. crassa and Schizosaccharomyces pombe telomerase reconstituted in vitro, and showed that the two RNA core domains in both systems can reconstitute activity in trans as two separate RNA fragments. Furthermore, the primer-extension pulse-chase analysis affirmed that the reconstituted N. crassa telomerase synthesizes TTAGGG repeats with high processivity, a common attribute of vertebrate telomerase. Overall, this study reveals the common ancestral cores of vertebrate and fungal TERs, and provides insights into the molecular evolution of fungal TER structure and function.
Cytosine DNA methylation is one of the major epigenetic modifications and influences gene expression, developmental processes, X-chromosome inactivation, and genomic imprinting. Aberrant methylation is furthermore known to be associated with several diseases including cancer. The gold standard to determine DNA methylation on genome-wide scales is bisulfite sequencing: DNA fragments are treated with sodium bisulfite resulting in the conversion of unmethylated cytosines into uracils, whereas methylated cytosines remain unchanged. The resulting sequencing reads thus exhibit asymmetric bisulfite-related mismatches and suffer from an effective reduction of the alphabet size in the unmethylated regions, rendering the mapping of bisulfite sequencing reads computationally much more demanding. As a consequence, currently available read mapping software often fails to achieve high sensitivity and in many cases requires unrealistic computational resources to cope with large real-life datasets.
Related JoVE Video
Journal of Visualized Experiments
What is Visualize?
JoVE Visualize is a tool created to match the last 5 years of PubMed publications to methods in JoVE's video library.
How does it work?
We use abstracts found on PubMed and match them to JoVE videos to create a list of 10 to 30 related methods videos.
Video X seems to be unrelated to Abstract Y...
In developing our video relationships, we compare around 5 million PubMed articles to our library of over 4,500 methods videos. In some cases the language used in the PubMed abstracts makes matching that content to a JoVE video difficult. In other cases, there happens not to be any content in our video library that is relevant to the topic of a given abstract. In these cases, our algorithms are trying their best to display videos with relevant content, which can sometimes result in matched videos with only a slight relation.