JoVE Visualize What is visualize?
Stop Reading. Start Watching.
Advanced Search
Stop Reading. Start Watching.
Regular Search
Find video protocols related to scientific articles indexed in Pubmed.
A new rhesus macaque assembly and annotation for next-generation sequencing analyses.
Biol. Direct
PUBLISHED: 07-18-2014
Show Abstract
Hide Abstract
The rhesus macaque (Macaca mulatta) is a key species for advancing biomedical research. Like all draft mammalian genomes, the draft rhesus assembly (rheMac2) has gaps, sequencing errors and misassemblies that have prevented automated annotation pipelines from functioning correctly. Another rhesus macaque assembly, CR_1.0, is also available but is substantially more fragmented than rheMac2 with smaller contigs and scaffolds. Annotations for these two assemblies are limited in completeness and accuracy. High quality assembly and annotation files are required for a wide range of studies including expression, genetic and evolutionary analyses.
Related JoVE Video
Genomic features of a bumble bee symbiont reflect its host environment.
Appl. Environ. Microbiol.
PUBLISHED: 04-18-2014
Show Abstract
Hide Abstract
Here, we report the genome of one gammaproteobacterial member of the gut microbiota, for which we propose the name "Candidatus Schmidhempelia bombi," that was inadvertently sequenced alongside the genome of its host, the bumble bee, Bombus impatiens. This symbiont is a member of the recently described bacterial order Orbales, which has been collected from the guts of diverse insect species; however, "Ca. Schmidhempelia" has been identified exclusively with bumble bees. Metabolic reconstruction reveals that "Ca. Schmidhempelia" lacks many genes for a functioning NADH dehydrogenase I, all genes for the high-oxygen cytochrome o, and most genes in the tricarboxylic acid (TCA) cycle. "Ca. Schmidhempelia" has retained NADH dehydrogenase II, the low-oxygen specific cytochrome bd, anaerobic nitrate respiration, mixed-acid fermentation pathways, and citrate fermentation, which may be important for survival in low-oxygen or anaerobic environments found in the bee hindgut. Additionally, a type 6 secretion system, a Flp pilus, and many antibiotic/multidrug transporters suggest complex interactions with its host and other gut commensals or pathogens. This genome has signatures of reduction (2.0 megabase pairs) and rearrangement, as previously observed for genomes of host-associated bacteria. A survey of wild and laboratory B. impatiens revealed that "Ca. Schmidhempelia" is present in 90% of individuals and, therefore, may provide benefits to its host.
Related JoVE Video
Unique features of the loblolly pine (Pinus taeda L.) megagenome revealed through sequence annotation.
Genetics
PUBLISHED: 03-22-2014
Show Abstract
Hide Abstract
The largest genus in the conifer family Pinaceae is Pinus, with over 100 species. The size and complexity of their genomes (?20-40 Gb, 2n = 24) have delayed the arrival of a well-annotated reference sequence. In this study, we present the annotation of the first whole-genome shotgun assembly of loblolly pine (Pinus taeda L.), which comprises 20.1 Gb of sequence. The MAKER-P annotation pipeline combined evidence-based alignments and ab initio predictions to generate 50,172 gene models, of which 15,653 are classified as high confidence. Clustering these gene models with 13 other plant species resulted in 20,646 gene families, of which 1554 are predicted to be unique to conifers. Among the conifer gene families, 159 are composed exclusively of loblolly pine members. The gene models for loblolly pine have the highest median and mean intron lengths of 24 fully sequenced plant genomes. Conifer genomes are full of repetitive DNA, with the most significant contributions from long-terminal-repeat retrotransposons. In depth analysis of the tandem and interspersed repetitive content yielded a combined estimate of 82%.
Related JoVE Video
Sequencing and assembly of the 22-gb loblolly pine genome.
Genetics
PUBLISHED: 03-22-2014
Show Abstract
Hide Abstract
Conifers are the predominant gymnosperm. The size and complexity of their genomes has presented formidable technical challenges for whole-genome shotgun sequencing and assembly. We employed novel strategies that allowed us to determine the loblolly pine (Pinus taeda) reference genome sequence, the largest genome assembled to date. Most of the sequence data were derived from whole-genome shotgun sequencing of a single megagametophyte, the haploid tissue of a single pine seed. Although that constrained the quantity of available DNA, the resulting haploid sequence data were well-suited for assembly. The haploid sequence was augmented with multiple linking long-fragment mate pair libraries from the parental diploid DNA. For the longest fragments, we used novel fosmid DiTag libraries. Sequences from the linking libraries that did not match the megagametophyte were identified and removed. Assembly of the sequence data were aided by condensing the enormous number of paired-end reads into a much smaller set of longer "super-reads," rendering subsequent assembly with an overlap-based assembly algorithm computationally feasible. To further improve the contiguity and biological utility of the genome sequence, additional scaffolding methods utilizing independent genome and transcriptome assemblies were implemented. The combination of these strategies resulted in a draft genome sequence of 20.15 billion bases, with an N50 scaffold size of 66.9 kbp.
Related JoVE Video
Kraken: ultrafast metagenomic sequence classification using exact alignments.
Genome Biol.
PUBLISHED: 03-03-2014
Show Abstract
Hide Abstract
Kraken is an ultrafast and highly accurate program for assigning taxonomic labels to metagenomic DNA sequences. Previous programs designed for this task have been relatively slow and computationally expensive, forcing researchers to use faster abundance estimation programs, which only classify small subsets of metagenomic data. Using exact alignment of k-mers, Kraken achieves classification accuracy comparable to the fastest BLAST program. In its fastest mode, Kraken classifies 100 base pair reads at a rate of over 4.1 million reads per minute, 909 times faster than Megablast and 11 times faster than the abundance estimation program MetaPhlAn. Kraken is available at http://ccb.jhu.edu/software/kraken/.
Related JoVE Video
The MaSuRCA genome assembler.
Bioinformatics
PUBLISHED: 08-29-2013
Show Abstract
Hide Abstract
Second-generation sequencing technologies produce high coverage of the genome by short reads at a low cost, which has prompted development of new assembly methods. In particular, multiple algorithms based on de Bruijn graphs have been shown to be effective for the assembly problem. In this article, we describe a new hybrid approach that has the computational efficiency of de Bruijn graph methods and the flexibility of overlap-based assembly strategies, and which allows variable read lengths while tolerating a significant level of sequencing error. Our method transforms large numbers of paired-end reads into a much smaller number of longer super-reads. The use of super-reads allows us to assemble combinations of Illumina reads of differing lengths together with longer reads from 454 and Sanger sequencing technologies, making it one of the few assemblers capable of handling such mixtures. We call our system the Maryland Super-Read Celera Assembler (abbreviated MaSuRCA and pronounced mazurka).
Related JoVE Video
Open access to tree genomes: the path to a better forest.
Genome Biol.
PUBLISHED: 06-24-2013
Show Abstract
Hide Abstract
An open-access culture and a well-developed comparative-genomics infrastructure must be developed in forest trees to derive the full potential of genome sequencing in this diverse group of plants that are the dominant species in much of the earths terrestrial ecosystems.
Related JoVE Video
GAGE-B: an evaluation of genome assemblers for bacterial organisms.
Bioinformatics
PUBLISHED: 05-10-2013
Show Abstract
Hide Abstract
A large and rapidly growing number of bacterial organisms have been sequenced by the newest sequencing technologies. Cheaper and faster sequencing technologies make it easy to generate very high coverage of bacterial genomes, but these advances mean that DNA preparation costs can exceed the cost of sequencing for small genomes. The need to contain costs often results in the creation of only a single sequencing library, which in turn introduces new challenges for genome assembly methods.
Related JoVE Video
TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions.
Genome Biol.
PUBLISHED: 04-25-2013
Show Abstract
Hide Abstract
TopHat is a popular spliced aligner for RNA-sequence (RNA-seq) experiments. In this paper, we describe TopHat2, which incorporates many significant enhancements to TopHat. TopHat2 can align reads of various lengths produced by the latest sequencing technologies, while allowing for variable-length indels with respect to the reference genome. In addition to de novo spliced alignment, TopHat2 can align reads across fusion breaks, which can occur after genomic translocations. TopHat2 combines the ability to identify novel splice sites with direct mapping to known transcripts, producing sensitive and accurate alignments, even for highly repetitive genomes or in the presence of pseudogenes. TopHat2 is available at http://ccb.jhu.edu/software/tophat.
Related JoVE Video
EDGE-pro: Estimated Degree of Gene Expression in Prokaryotic Genomes.
Evol. Bioinform. Online
PUBLISHED: 03-10-2013
Show Abstract
Hide Abstract
The expression levels of bacterial genes can be measured directly using next-generation sequencing (NGS) methods, offering much greater sensitivity and accuracy than earlier, microarray-based methods. Most bioinformatics software for estimating levels of gene expression from NGS data has been designed for eukaryotic genomes, with algorithms focusing particularly on detection of splicing patterns. These methods do not perform well on bacterial genomes.
Related JoVE Video
Insights into the loblolly pine genome: characterization of BAC and fosmid sequences.
PLoS ONE
PUBLISHED: 01-01-2013
Show Abstract
Hide Abstract
Despite their prevalence and importance, the genome sequences of loblolly pine, Norway spruce, and white spruce, three ecologically and economically important conifer species, are just becoming available to the research community. Following the completion of these large assemblies, annotation efforts will be undertaken to characterize the reference sequences. Accurate annotation of these ancient genomes would be aided by a comprehensive repeat library; however, few studies have generated enough sequence to fully evaluate and catalog their non-genic content. In this paper, two sets of loblolly pine genomic sequence, 103 previously assembled BACs and 90,954 newly sequenced and assembled fosmid scaffolds, were analyzed. Together, this sequence represents 280 Mbp (roughly 1% of the loblolly pine genome) and one of the most comprehensive studies of repetitive elements and genes in a gymnosperm species. A combination of homology and de novo methodologies were applied to identify both conserved and novel repeats. Similarity analysis estimated a repetitive content of 27% that included both full and partial elements. When combined with the de novo investigation, the estimate increased to almost 86%. Over 60% of the repetitive sequence consists of full or partial LTR (long terminal repeat) retrotransposons. Through de novo approaches, 6,270 novel, full-length transposable element families and 9,415 sub-families were identified. Among those 6,270 families, 82% were annotated as single-copy. Several of the novel, high-copy families are described here, with the largest, PtPiedmont, comprising 133 full-length copies. In addition to repeats, analysis of the coding region reported 23 full-length eukaryotic orthologous proteins (KOGS) and another 29 novel or orthologous genes. These discoveries, along with other genomic resources, will be used to annotate conifer genomes and address long-standing questions about gymnosperm evolution.
Related JoVE Video
Hawkeye and AMOS: visualizing and assessing the quality of genome assemblies.
Brief. Bioinformatics
PUBLISHED: 12-23-2011
Show Abstract
Hide Abstract
Since its launch in 2004, the open-source AMOS project has released several innovative DNA sequence analysis applications including: Hawkeye, a visual analytics tool for inspecting the structure of genome assemblies; the Assembly Forensics and FRCurve pipelines for systematically evaluating the quality of a genome assembly; and AMOScmp, the first comparative genome assembler. These applications have been used to assemble and analyze dozens of genomes ranging in complexity from simple microbial species through mammalian genomes. Recent efforts have been focused on enhancing support for new data characteristics brought on by second- and now third-generation sequencing. This review describes the major components of AMOS in light of these challenges, with an emphasis on methods for assessing assembly quality and the visual analytics capabilities of Hawkeye. These interactive graphical aspects are essential for navigating and understanding the complexities of a genome assembly, from the overall genome structure down to individual bases. Hawkeye and AMOS are available open source at http://amos.sourceforge.net.
Related JoVE Video
Repetitive DNA and next-generation sequencing: computational challenges and solutions.
Nat. Rev. Genet.
PUBLISHED: 11-29-2011
Show Abstract
Hide Abstract
Repetitive DNA sequences are abundant in a broad range of species, from bacteria to mammals, and they cover nearly half of the human genome. Repeats have always presented technical challenges for sequence alignment and assembly programs. Next-generation sequencing projects, with their short read lengths and high data volumes, have made these challenges more difficult. From a computational perspective, repeats create ambiguities in alignment and assembly, which, in turn, can produce biases and errors when interpreting results. Simply ignoring repeats is not an option, as this creates problems of its own and may mean that important biological phenomena are missed. We discuss the computational problems surrounding repeats and describe strategies used by current bioinformatics systems to solve them.
Related JoVE Video
Gene prediction with Glimmer for metagenomic sequences augmented by classification and clustering.
Nucleic Acids Res.
PUBLISHED: 11-18-2011
Show Abstract
Hide Abstract
Environmental shotgun sequencing (or metagenomics) is widely used to survey the communities of microbial organisms that live in many diverse ecosystems, such as the human body. Finding the protein-coding genes within the sequences is an important step for assessing the functional capacity of a metagenome. In this work, we developed a metagenomics gene prediction system Glimmer-MG that achieves significantly greater accuracy than previous systems via novel approaches to a number of important prediction subtasks. First, we introduce the use of phylogenetic classifications of the sequences to model parameterization. We also cluster the sequences, grouping together those that likely originated from the same organism. Analogous to iterative schemes that are useful for whole genomes, we retrain our models within each cluster on the initial gene predictions before making final predictions. Finally, we model both insertion/deletion and substitution sequencing errors using a different approach than previous software, allowing Glimmer-MG to change coding frame or pass through stop codons by predicting an error. In a comparison among multiple gene finding methods, Glimmer-MG makes the most sensitive and precise predictions on simulated and real metagenomes for all read lengths and error rates tested.
Related JoVE Video
FLASH: fast length adjustment of short reads to improve genome assemblies.
Bioinformatics
PUBLISHED: 09-07-2011
Show Abstract
Hide Abstract
Next-generation sequencing technologies generate very large numbers of short reads. Even with very deep genome coverage, short read lengths cause problems in de novo assemblies. The use of paired-end libraries with a fragment size shorter than twice the read length provides an opportunity to generate much longer reads by overlapping and merging read pairs before assembling a genome.
Related JoVE Video
Two new complete genome sequences offer insight into host and tissue specificity of plant pathogenic Xanthomonas spp.
J. Bacteriol.
PUBLISHED: 07-22-2011
Show Abstract
Hide Abstract
Xanthomonas is a large genus of bacteria that collectively cause disease on more than 300 plant species. The broad host range of the genus contrasts with stringent host and tissue specificity for individual species and pathovars. Whole-genome sequences of Xanthomonas campestris pv. raphani strain 756C and X. oryzae pv. oryzicola strain BLS256, pathogens that infect the mesophyll tissue of the leading models for plant biology, Arabidopsis thaliana and rice, respectively, were determined and provided insight into the genetic determinants of host and tissue specificity. Comparisons were made with genomes of closely related strains that infect the vascular tissue of the same hosts and across a larger collection of complete Xanthomonas genomes. The results suggest a model in which complex sets of adaptations at the level of gene content account for host specificity and subtler adaptations at the level of amino acid or noncoding regulatory nucleotide sequence determine tissue specificity.
Related JoVE Video
Detection of lineage-specific evolutionary changes among primate species.
BMC Bioinformatics
PUBLISHED: 05-25-2011
Show Abstract
Hide Abstract
Comparison of the human genome with other primates offers the opportunity to detect evolutionary events that created the diverse phenotypes among the primate species. Because the primate genomes are highly similar to one another, methods developed for analysis of more divergent species do not always detect signs of evolutionary selection.
Related JoVE Video
TopHat-Fusion: an algorithm for discovery of novel fusion transcripts.
Genome Biol.
PUBLISHED: 05-19-2011
Show Abstract
Hide Abstract
TopHat-Fusion is an algorithm designed to discover transcripts representing fusion gene products, which result from the breakage and re-joining of two different chromosomes, or from rearrangements within a chromosome. TopHat-Fusion is an enhanced version of TopHat, an efficient program that aligns RNA-seq reads without relying on existing annotation. Because it is independent of gene annotation, TopHat-Fusion can discover fusion products deriving from known genes, unknown genes and unannotated splice variants of known genes. Using RNA-seq data from breast and prostate cancer cell lines, we detected both previously reported and novel fusions with solid supporting evidence. TopHat-Fusion is available at http://tophat-fusion.sourceforge.net/.
Related JoVE Video
Complete Columbian mammoth mitogenome suggests interbreeding with woolly mammoths.
Genome Biol.
PUBLISHED: 04-01-2011
Show Abstract
Hide Abstract
Late Pleistocene North America hosted at least two divergent and ecologically distinct species of mammoth: the periglacial woolly mammoth (Mammuthus primigenius) and the subglacial Columbian mammoth (Mammuthus columbi). To date, mammoth genetic research has been entirely restricted to woolly mammoths, rendering their genetic evolution difficult to contextualize within broader Pleistocene paleoecology and biogeography. Here, we take an interspecific approach to clarifying mammoth phylogeny by targeting Columbian mammoth remains for mitogenomic sequencing.
Related JoVE Video
Improving pan-genome annotation using whole genome multiple alignment.
BMC Bioinformatics
PUBLISHED: 03-14-2011
Show Abstract
Hide Abstract
Rapid annotation and comparisons of genomes from multiple isolates (pan-genomes) is becoming commonplace due to advances in sequencing technology. Genome annotations can contain inconsistencies and errors that hinder comparative analysis even within a single species. Tools are needed to compare and improve annotation quality across sets of closely related genomes.
Related JoVE Video
Bacillus anthracis comparative genome analysis in support of the Amerithrax investigation.
Proc. Natl. Acad. Sci. U.S.A.
PUBLISHED: 03-07-2011
Show Abstract
Hide Abstract
Before the anthrax letter attacks of 2001, the developing field of microbial forensics relied on microbial genotyping schemes based on a small portion of a genome sequence. Amerithrax, the investigation into the anthrax letter attacks, applied high-resolution whole-genome sequencing and comparative genomics to identify key genetic features of the letters Bacillus anthracis Ames strain. During systematic microbiological analysis of the spore material from the letters, we identified a number of morphological variants based on phenotypic characteristics and the ability to sporulate. The genomes of these morphological variants were sequenced and compared with that of the B. anthracis Ames ancestor, the progenitor of all B. anthracis Ames strains. Through comparative genomics, we identified four distinct loci with verifiable genetic mutations. Three of the four mutations could be directly linked to sporulation pathways in B. anthracis and more specifically to the regulation of the phosphorylation state of Spo0F, a key regulatory protein in the initiation of the sporulation cascade, thus linking phenotype to genotype. None of these variant genotypes were identified in single-colony environmental B. anthracis Ames isolates associated with the investigation. These genotypes were identified only in B. anthracis morphotypes isolated from the letters, indicating that the variants were not prevalent in the environment, not even the environments associated with the investigation. This study demonstrates the forensic value of systematic microbiological analysis combined with whole-genome sequencing and comparative genomics.
Related JoVE Video
Genome assembly has a major impact on gene content: a comparison of annotation in two Bos taurus assemblies.
PLoS ONE
PUBLISHED: 03-04-2011
Show Abstract
Hide Abstract
Gene and SNP annotation are among the first and most important steps in analyzing a genome. As the number of sequenced genomes continues to grow, a key question is: how does the quality of the assembled sequence affect the annotations? We compared the gene and SNP annotations for two different Bos taurus genome assemblies built from the same data but with significant improvements in the later assembly. The same annotation software was used for annotating both sequences. While some annotation differences are expected even between high-quality assemblies such as these, we found that a staggering 40% of the genes (>9,500) varied significantly between assemblies, due in part to the availability of new gene evidence but primarily to genome mis-assembly events and local sequence variations. For instance, although the later assembly is generally superior, 660 protein coding genes in the earlier assembly are entirely missing from the later genomes annotation, and approximately 3,600 (15%) of the genes have complex structural differences between the two assemblies. In addition, 12-20% of the predicted proteins in both assemblies have relatively large sequence differences when compared to their RefSeq models, and 6-15% of bovine dbSNP records are unrecoverable in the two assemblies. Our findings highlight the consequences of genome assembly quality on gene and SNP annotation and argue for continued improvements in any draft genome sequence. We also found that tracking a gene between different assemblies of the same genome is surprisingly difficult, due to the numerous changes, both small and large, that occur in some genes. As a side benefit, our analyses helped us identify many specific loci for improvement in the Bos taurus genome assembly.
Related JoVE Video
Mugsy: fast multiple alignment of closely related whole genomes.
Bioinformatics
PUBLISHED: 12-09-2010
Show Abstract
Hide Abstract
The relative ease and low cost of current generation sequencing technologies has led to a dramatic increase in the number of sequenced genomes for species across the tree of life. This increasing volume of data requires tools that can quickly compare multiple whole-genome sequences, millions of base pairs in length, to aid in the study of populations, pan-genomes, and genome evolution.
Related JoVE Video
COMBREX: a project to accelerate the functional annotation of prokaryotic genomes.
Nucleic Acids Res.
PUBLISHED: 11-21-2010
Show Abstract
Hide Abstract
COMBREX (http://combrex.bu.edu) is a project to increase the speed of the functional annotation of new bacterial and archaeal genomes. It consists of a database of functional predictions produced by computational biologists and a mechanism for experimental biochemists to bid for the validation of those predictions. Small grants are available to support successful bids.
Related JoVE Video
Do-it-yourself genetic testing.
Genome Biol.
PUBLISHED: 10-07-2010
Show Abstract
Hide Abstract
We developed a computational screen that tests an individuals genome for mutations in the BRCA genes, despite the fact that both are currently protected by patents.
Related JoVE Video
Genome sequence of the dioxin-mineralizing bacterium Sphingomonas wittichii RW1.
J. Bacteriol.
PUBLISHED: 09-10-2010
Show Abstract
Hide Abstract
Pollutants such as polychlorinated biphenyls and dioxins pose a serious threat to human and environmental health. Natural attenuation of these compounds by microorganisms provides one promising avenue for their removal from contaminated areas. Over the past 2 decades, studies of the bacterium Sphingomonas wittichii RW1 have provided a wealth of knowledge about how bacteria metabolize chlorinated aromatic hydrocarbons. Here we describe the finished genome sequence of S. wittichii RW1 and major findings from its annotation.
Related JoVE Video
Quake: quality-aware detection and correction of sequencing errors.
Genome Biol.
PUBLISHED: 09-07-2010
Show Abstract
Hide Abstract
We introduce Quake, a program to detect and correct errors in DNA sequencing reads. Using a maximum likelihood approach incorporating quality values and nucleotide specific miscall rates, Quake achieves the highest accuracy on realistically simulated reads. We further demonstrate substantial improvements in de novo assembly and SNP detection after using Quake. Quake can be used for any size project, including more than one billion human reads, and is freely available as open source software from http://www.cbcb.umd.edu/software/quake.
Related JoVE Video
Recent advances in RNA sequence analysis.
F1000 Biol Rep
PUBLISHED: 08-19-2010
Show Abstract
Hide Abstract
The latest high-throughput DNA sequencing technology can now be applied on a large scale to capture the complete set of mRNA transcripts in a cell, using a technique called RNA-seq. Although RNA-seq is only 2 years old, it has rapidly swept through the field of genomics, and it is now being used to analyze the transcriptomes of organisms ranging from bacteria to primates. The depth of sequencing allows researchers to quantify the level of expression of genes, to discover alternative isoforms in eukaryotic species, and even to characterize the operon structure of bacterial genomes.
Related JoVE Video
Multi-platform next-generation sequencing of the domestic turkey (Meleagris gallopavo): genome assembly and analysis.
PLoS Biol.
PUBLISHED: 07-27-2010
Show Abstract
Hide Abstract
A synergistic combination of two next-generation sequencing platforms with a detailed comparative BAC physical contig map provided a cost-effective assembly of the genome sequence of the domestic turkey (Meleagris gallopavo). Heterozygosity of the sequenced source genome allowed discovery of more than 600,000 high quality single nucleotide variants. Despite this heterozygosity, the current genome assembly (?1.1 Gb) includes 917 Mb of sequence assigned to specific turkey chromosomes. Annotation identified nearly 16,000 genes, with 15,093 recognized as protein coding and 611 as non-coding RNA genes. Comparative analysis of the turkey, chicken, and zebra finch genomes, and comparing avian to mammalian species, supports the characteristic stability of avian genomes and identifies genes unique to the avian lineage. Clear differences are seen in number and variety of genes of the avian immune system where expansions and novel genes are less frequent than examples of gene loss. The turkey genome sequence provides resources to further understand the evolution of vertebrate genomes and genetic variation underlying economically important quantitative traits in poultry. This integrated approach may be a model for providing both gene and chromosome level assemblies of other species with agricultural, ecological, and evolutionary interest.
Related JoVE Video
The genome of woodland strawberry (Fragaria vesca).
Nat. Genet.
PUBLISHED: 06-09-2010
Show Abstract
Hide Abstract
The woodland strawberry, Fragaria vesca (2n = 2x = 14), is a versatile experimental plant system. This diminutive herbaceous perennial has a small genome (240 Mb), is amenable to genetic transformation and shares substantial sequence identity with the cultivated strawberry (Fragaria × ananassa) and other economically important rosaceous plants. Here we report the draft F. vesca genome, which was sequenced to ×39 coverage using second-generation technology, assembled de novo and then anchored to the genetic linkage map into seven pseudochromosomes. This diploid strawberry sequence lacks the large genome duplications seen in other rosids. Gene prediction modeling identified 34,809 genes, with most being supported by transcriptome mapping. Genes critical to valuable horticultural traits including flavor, nutritional value and flowering time were identified. Macrosyntenic relationships between Fragaria and Prunus predict a hypothetical ancestral Rosaceae genome that had nine chromosomes. New phylogenetic analysis of 154 protein-coding genes suggests that assignment of Populus to Malvidae, rather than Fabidae, is warranted.
Related JoVE Video
Assembly of large genomes using second-generation sequencing.
Genome Res.
PUBLISHED: 05-27-2010
Show Abstract
Hide Abstract
Second-generation sequencing technology can now be used to sequence an entire human genome in a matter of days and at low cost. Sequence read lengths, initially very short, have rapidly increased since the technology first appeared, and we now are seeing a growing number of efforts to sequence large genomes de novo from these short reads. In this Perspective, we describe the issues associated with short-read assembly, the different types of data produced by second-gen sequencers, and the latest assembly algorithms designed for these data. We also review the genomes that have been assembled recently from short reads and make recommendations for sequencing strategies that will yield a high-quality assembly.
Related JoVE Video
Clustering metagenomic sequences with interpolated Markov models.
BMC Bioinformatics
PUBLISHED: 05-13-2010
Show Abstract
Hide Abstract
Sequencing of environmental DNA (often called metagenomics) has shown tremendous potential to uncover the vast number of unknown microbes that cannot be cultured and sequenced by traditional methods. Because the output from metagenomic sequencing is a large set of reads of unknown origin, clustering reads together that were sequenced from the same species is a crucial analysis step. Many effective approaches to this task rely on sequenced genomes in public databases, but these genomes are a highly biased sample that is not necessarily representative of environments interesting to many metagenomics projects.
Related JoVE Video
Probing the pan-genome of Listeria monocytogenes: new insights into intraspecific niche expansion and genomic diversification.
BMC Genomics
PUBLISHED: 05-10-2010
Show Abstract
Hide Abstract
Bacterial pathogens often show significant intraspecific variations in ecological fitness, host preference and pathogenic potential to cause infectious disease. The species of Listeria monocytogenes, a facultative intracellular pathogen and the causative agent of human listeriosis, consists of at least three distinct genetic lineages. Two of these lineages predominantly cause human sporadic and epidemic infections, whereas the third lineage has never been implicated in human disease outbreaks despite its overall conservation of many known virulence factors.
Related JoVE Video
Between a chicken and a grape: estimating the number of human genes.
Genome Biol.
PUBLISHED: 05-05-2010
Show Abstract
Hide Abstract
Many people expected the question How many genes in the human genome? to be resolved with the publication of the genome sequence in 2001, but estimates continue to fluctuate.
Related JoVE Video
Detection and correction of false segmental duplications caused by genome mis-assembly.
Genome Biol.
PUBLISHED: 03-10-2010
Show Abstract
Hide Abstract
Diploid genomes with divergent chromosomes present special problems for assembly software as two copies of especially polymorphic regions may be mistakenly constructed, creating the appearance of a recent segmental duplication. We developed a method for identifying such false duplications and applied it to four vertebrate genomes. For each genome, we corrected mis-assemblies, improved estimates of the amount of duplicated sequence, and recovered polymorphisms between the sequenced chromosomes.
Related JoVE Video
Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation.
Nat. Biotechnol.
PUBLISHED: 02-02-2010
Show Abstract
Hide Abstract
High-throughput mRNA sequencing (RNA-Seq) promises simultaneous transcript discovery and abundance estimation. However, this would require algorithms that are not restricted by prior gene annotations and that account for alternative transcription and splicing. Here we introduce such algorithms in an open-source software program called Cufflinks. To test Cufflinks, we sequenced and analyzed >430 million paired 75-bp RNA-Seq reads from a mouse myoblast cell line over a differentiation time series. We detected 13,692 known transcripts and 3,724 previously unannotated ones, 62% of which are supported by independent expression data or by homologous genes in other species. Over the time series, 330 genes showed complete switches in the dominant transcription start site (TSS) or splice isoform, and we observed more subtle shifts in 1,304 other genes. These results suggest that Cufflinks can illuminate the substantial regulatory flexibility and complexity in even this well-studied model of muscle development and that it can improve transcriptome-based genome annotation.
Related JoVE Video
Searching for SNPs with cloud computing.
Genome Biol.
PUBLISHED: 09-30-2009
Show Abstract
Hide Abstract
As DNA sequencing outpaces improvements in computer speed, there is a critical need to accelerate tasks like alignment and SNP calling. Crossbow is a cloud-computing software tool that combines the aligner Bowtie and the SNP caller SOAPsnp. Executing in parallel using Hadoop, Crossbow analyzes data comprising 38-fold coverage of the human genome in three hours using a 320-CPU cluster rented from a cloud computing service for about $85. Crossbow is available from http://bowtie-bio.sourceforge.net/crossbow/.
Related JoVE Video
2009 Swine-origin influenza A (H1N1) resembles previous influenza isolates.
PLoS ONE
PUBLISHED: 06-03-2009
Show Abstract
Hide Abstract
In April 2009, novel swine-origin influenza viruses (S-OIV) were identified in patients from Mexico and the United States. The viruses were genetically characterized as a novel influenza A (H1N1) strain originating in swine, and within a very short time the S-OIV strain spread across the globe via human-to-human contact.
Related JoVE Video
How to map billions of short reads onto genomes.
Nat. Biotechnol.
PUBLISHED: 05-12-2009
Show Abstract
Hide Abstract
Mapping the vast quantities of short sequence fragments produced by next-generation sequencing platforms is a challenge. What programs are available and how do they work?
Related JoVE Video
Insignia: a DNA signature search web server for diagnostic assay development.
Nucleic Acids Res.
PUBLISHED: 05-05-2009
Show Abstract
Hide Abstract
Insignia is a web application for the rapid identification of unique DNA signatures. DNA signatures are distinct nucleotide sequences that can be used to detect the presence of certain organisms and to distinguish those organisms from all other species. These signatures can be used as the basis for diagnostic assays to detect and genotype microbes in both environmental and clinical samples. Insignia identifies an exhaustive set of accurate DNA signatures for any set of target genomes, and screens these signatures against a comprehensive background that includes all sequenced bacteria and viruses, the human genome, and many other animals and plants. Identified signatures may be browsed by genomic location or proximal genes, filtered by composition, viewed in a genome browser or directly downloaded. Integrated PCR primer design is also provided for each signature. The Insignia website (http://insignia.cbcb.umd.edu) is free and open to all users and there is no login requirement. In addition, the source code for the computational pipeline is freely available.
Related JoVE Video
Efficient oligonucleotide probe selection for pan-genomic tiling arrays.
BMC Bioinformatics
PUBLISHED: 03-30-2009
Show Abstract
Hide Abstract
Array comparative genomic hybridization is a fast and cost-effective method for detecting, genotyping, and comparing the genomic sequence of unknown bacterial isolates. This method, as with all microarray applications, requires adequate coverage of probes targeting the regions of interest. An unbiased tiling of probes across the entire length of the genome is the most flexible design approach. However, such a whole-genome tiling requires that the genome sequence is known in advance. For the accurate analysis of uncharacterized bacteria, an array must query a fully representative set of sequences from the species pan-genome. Prior microarrays have included only a single strain per array or the conserved sequences of gene families. These arrays omit potentially important genes and sequence variants from the pan-genome.
Related JoVE Video
Genome sequence of the Wolbachia endosymbiont of Culex quinquefasciatus JHB.
J. Bacteriol.
PUBLISHED: 03-24-2009
Show Abstract
Hide Abstract
Wolbachia species are endosymbionts of a wide range of invertebrates, including mosquitoes, fruit flies, and nematodes. The wPip strains can cause cytoplasmic incompatibility in some strains of the Culex mosquito. Here we describe the genome sequence of a Wolbachia strain that was discovered in the whole-genome sequencing data for the mosquito Culex quinquefasciatus strain JHB.
Related JoVE Video
TopHat: discovering splice junctions with RNA-Seq.
Bioinformatics
PUBLISHED: 03-16-2009
Show Abstract
Hide Abstract
A new protocol for sequencing the messenger RNA in a cell, known as RNA-Seq, generates millions of short sequence fragments in a single run. These fragments, or reads, can be used to measure levels of gene expression and to identify novel splice variants of genes. However, current software for aligning RNA-Seq data to a genome relies on known splice junctions and cannot identify novel ones. TopHat is an efficient read-mapping algorithm designed to align reads from an RNA-Seq experiment to a reference genome without relying on known splice sites.
Related JoVE Video
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.
Genome Biol.
PUBLISHED: 03-04-2009
Show Abstract
Hide Abstract
Bowtie is an ultrafast, memory-efficient alignment program for aligning short DNA sequence reads to large genomes. For the human genome, Burrows-Wheeler indexing allows Bowtie to align more than 25 million reads per CPU hour with a memory footprint of approximately 1.3 gigabytes. Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches. Multiple processor cores can be used simultaneously to achieve even greater alignment speeds. Bowtie is open source (http://bowtie.cbcb.umd.edu).
Related JoVE Video
OperonDB: a comprehensive database of predicted operons in microbial genomes.
Nucleic Acids Res.
PUBLISHED: 03-04-2009
Show Abstract
Hide Abstract
The fast pace of bacterial genome sequencing and the resulting dependence on highly automated annotation methods has driven the development of many genome-wide analysis tools. OperonDB, first released in 2001, is a database containing the results of a computational algorithm for locating operon structures in microbial genomes. OperonDB has grown from 34 genomes in its initial release to more than 500 genomes today. In addition to increasing the size of the database, we have re-designed our operon finding algorithm and improved its accuracy. The new database is updated regularly as additional genomes become available in public archives. OperonDB can be accessed at: http://operondb.cbcb.umd.edu.
Related JoVE Video
Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models.
Nat. Methods
PUBLISHED: 03-02-2009
Show Abstract
Hide Abstract
Metagenomics projects collect DNA from uncharacterized environments that may contain thousands of species per sample. One main challenge facing metagenomic analysis is phylogenetic classification of raw sequence reads into groups representing the same or similar taxa, a prerequisite for genome assembly and for analyzing the biological diversity of a sample. New sequencing technologies have made metagenomics easier, by making sequencing faster, and more difficult, by producing shorter reads than previous technologies. Classifying sequences from reads as short as 100 base pairs has until now been relatively inaccurate, requiring researchers to use older, long-read technologies. We present Phymm, a classifier for metagenomic data, that has been trained on 539 complete, curated genomes and can accurately classify reads as short as 100 base pairs, a substantial improvement over previous composition-based classification methods. We also describe how combining Phymm with sequence alignment algorithms improves accuracy.
Related JoVE Video
The complete genome sequence of Bacillus anthracis Ames "Ancestor".
J. Bacteriol.
PUBLISHED: 02-04-2009
Show Abstract
Hide Abstract
The pathogenic bacterium Bacillus anthracis has become the subject of intense study as a result of its use in a bioterrorism attack in the United States in September and October 2001. Previous studies suggested that B. anthracis Ames Ancestor, the original Ames fully virulent plasmid-containing isolate, was the ideal reference. This study describes the complete genome sequence of that original isolate, derived from a sample kept in cold storage since 1981.
Related JoVE Video
The genome of the blood fluke Schistosoma mansoni.
Nature
PUBLISHED: 01-18-2009
Show Abstract
Hide Abstract
Schistosoma mansoni is responsible for the neglected tropical disease schistosomiasis that affects 210 million people in 76 countries. Here we present analysis of the 363 megabase nuclear genome of the blood fluke. It encodes at least 11,809 genes, with an unusual intron size distribution, and new families of micro-exon genes that undergo frequent alternative splicing. As the first sequenced flatworm, and a representative of the Lophotrochozoa, it offers insights into early events in the evolution of the animals, including the development of a body pattern with bilateral symmetry, and the development of tissues into organs. Our analysis has been informed by the need to find new drug targets. The deficits in lipid metabolism that make schistosomes dependent on the host are revealed, and the identification of membrane receptors, ion channels and more than 300 proteases provide new insights into the biology of the life cycle and new targets. Bioinformatics approaches have identified metabolic chokepoints, and a chemogenomic screen has pinpointed schistosome proteins for which existing drugs may be active. The information generated provides an invaluable resource for the research community to develop much needed new control tools for the treatment and eradication of this important and neglected disease.
Related JoVE Video
A whole-genome assembly of the domestic cow, Bos taurus.
Genome Biol.
PUBLISHED: 01-07-2009
Show Abstract
Hide Abstract
The genome of the domestic cow, Bos taurus, was sequenced using a mixture of hierarchical and whole-genome shotgun sequencing methods.
Related JoVE Video
Thousands of missed genes found in bacterial genomes and their analysis with COMBREX.
Biol. Direct
Show Abstract
Hide Abstract
The dramatic reduction in the cost of sequencing has allowed many researchers to join in the effort of sequencing and annotating prokaryotic genomes. Annotation methods vary considerably and may fail to identify some genes. Here we draw attention to a large number of likely genes missing from annotations using common tools such as Glimmer and BLAST.
Related JoVE Video
Mis-assembled "segmental duplications" in two versions of the Bos taurus genome.
PLoS ONE
Show Abstract
Hide Abstract
We analyzed the whole genome sequence coverage in two versions of the Bos taurus genome and identified all regions longer than five kilobases (Kbp) that are duplicated within chromosomes with >99% sequence fidelity in both copies. We call these regions High Fidelity Duplications (HFDs). The two assemblies were Btau 4.2, produced by the Human Genome Sequencing Center at Baylor College of Medicine, and UMD Bos taurus 3.1 (UMD 3.1), produced by our group at the University of Maryland. We found that Btau 4.2 has a far greater number of HFDs, 3111 versus only 69 in UMD 3.1. Read coverage analysis shows that 39 million base pairs (Mbp) of sequence in HFDs in Btau 4.2 appear to be a result of a mis-assembly and therefore cannot be qualified as segmental duplications. UMD 3.1 has only 0.41 Mbp of sequence in HFDs that are due to a mis-assembly.
Related JoVE Video
Fast gapped-read alignment with Bowtie 2.
Nat. Methods
Show Abstract
Hide Abstract
As the rate of sequencing increases, greater throughput is demanded from read aligners. The full-text minute index is often used to make alignment very fast and memory-efficient, but the approach is ill-suited to finding longer, gapped alignments. Bowtie 2 combines the strengths of the full-text minute index with the flexibility and speed of hardware-accelerated dynamic programming algorithms to achieve a combination of high speed, sensitivity and accuracy.
Related JoVE Video
Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks.
Nat Protoc
Show Abstract
Hide Abstract
Recent advances in high-throughput cDNA sequencing (RNA-seq) can reveal new genes and splice variants and quantify expression genome-wide in a single assay. The volume and complexity of data from RNA-seq experiments necessitate scalable, fast and mathematically principled analysis software. TopHat and Cufflinks are free, open-source software tools for gene discovery and comprehensive expression analysis of high-throughput mRNA sequencing (RNA-seq) data. Together, they allow biologists to identify new genes and new splice variants of known ones, as well as compare gene and transcript expression under two or more conditions. This protocol describes in detail how to use TopHat and Cufflinks to perform such analyses. It also covers several accessory tools and utilities that aid in managing data, including CummeRbund, a tool for visualizing RNA-seq analysis results. Although the procedure assumes basic informatics skills, these tools assume little to no background with RNA-seq analysis and are meant for novices and experts alike. The protocol begins with raw sequencing reads and produces a transcriptome assembly, lists of differentially expressed and regulated genes and transcripts, and publication-quality visualizations of analysis results. The protocols execution time depends on the volume of transcriptome sequencing data and available computing resources but takes less than 1 d of computer time for typical experiments and ?1 h of hands-on time.
Related JoVE Video
Diamund: Direct Comparison of Genomes to Detect Mutations.
Hum. Mutat.
Show Abstract
Hide Abstract
DNA sequencing has become a powerful method to discover the genetic basis of disease. Standard, widely-used protocols for analysis usually begin by comparing each individual to the human reference genome. When applied to a set of related individuals, this approach reveals millions of differences, most of which are shared among the individuals and unrelated to the disease being investigated. We have developed a novel algorithm for variant detection, one that compares DNA sequences directly to one another, without aligning them to the reference genome. When used to find de novo mutations in exome sequences from family trios, or to compare normal and diseased samples from the same individual, the new method, Diamund, produces a dramatically smaller list of candidate mutations than previous methods, without losing sensitivity to detect the true cause of a genetic disease. We demonstrate our results on several example cases, including two family trios in which it correctly found the disease-causing variant while excluding thousands of harmless variants that standard methods had identified. This article is protected by copyright. All rights reserved.
Related JoVE Video

What is Visualize?

JoVE Visualize is a tool created to match the last 5 years of PubMed publications to methods in JoVE's video library.

How does it work?

We use abstracts found on PubMed and match them to JoVE videos to create a list of 10 to 30 related methods videos.

Video X seems to be unrelated to Abstract Y...

In developing our video relationships, we compare around 5 million PubMed articles to our library of over 4,500 methods videos. In some cases the language used in the PubMed abstracts makes matching that content to a JoVE video difficult. In other cases, there happens not to be any content in our video library that is relevant to the topic of a given abstract. In these cases, our algorithms are trying their best to display videos with relevant content, which can sometimes result in matched videos with only a slight relation.