A number of statistical phylogenetic methods have been developed to infer conserved functional sites or regions in proteins. Many methods, e.g. Rate4Site, apply the standard phylogenetic models to infer site-specific substitution rates and totally ignore the spatial correlation of substitution rates in protein tertiary structures, which may reduce their power to identify conserved functional patches in protein tertiary structures when the sequences used in the analysis are highly similar. The 3D sliding window method has been proposed to infer conserved functional patches in protein tertiary structures, but the window size, which reflects the strength of the spatial correlation, must be predefined and is not inferred from data. We recently developed GP4Rate to solve these problems under the Bayesian framework. Unfortunately, GP4Rate is computationally slow. Here we present an intuitive web server, FuncPatch, to perform a fast approximate Bayesian inference of conserved functional patches in protein tertiary structures.
Many bacteria carry two or more chromosome-like replicons. This occurs in pathogens such as Vibrio cholerea and Brucella abortis as well as in many N2-fixing plant symbionts including all isolates of the alfalfa root-nodule bacteria Sinorhizobium meliloti. Understanding the evolution and role of this multipartite genome organization will provide significant insight into these important organisms; yet this knowledge remains incomplete, in part, because technical challenges of large-scale genome manipulations have limited experimental analyses. The distinct evolutionary histories and characteristics of the three replicons that constitute the S. meliloti genome (the chromosome (3.65 Mb), pSymA megaplasmid (1.35 Mb), and pSymB chromid (1.68 Mb)) makes this a good model to examine this topic. We transferred essential genes from pSymB into the chromosome, and constructed strains that lack pSymB as well as both pSymA and pSymB. This is the largest reduction (45.4%, 3.04 megabases, 2866 genes) of a prokaryotic genome to date and the first removal of an essential chromid. Strikingly, strains lacking pSymA and pSymB (?pSymAB) lost the ability to utilize 55 of 74 carbon sources and various sources of nitrogen, phosphorous and sulfur, yet the ?pSymAB strain grew well in minimal salts media and in sterile soil. This suggests that the core chromosome is sufficient for growth in a bulk soil environment and that the pSymA and pSymB replicons carry genes with more specialized functions such as growth in the rhizosphere and interaction with the plant. These experimental data support a generalized evolutionary model, in which non-chromosomal replicons primarily carry genes with more specialized functions. These large secondary replicons increase the organism's niche range, which offsets their metabolic burden on the cell (e.g. pSymA). Subsequent co-evolution with the chromosome then leads to the formation of a chromid through the acquisition of functions core to all niches (e.g. pSymB).
Previous studies have found that DNA-flanking low-complexity regions (LCRs) have an increased substitution rate. Here, the substitution rate was confirmed to increase in the vicinity of LCRs in several primate species, including humans. This effect was also found among human sequences from the 1000 Genomes Project. A strong correlation was found between average substitution rate per site and distance from the LCR, as well as the proportion of genes with gaps in the alignment at each site and distance from the LCR. Along with substitution rates, dN/dS ratios were also determined for each site, and the proportion of sites undergoing negative selection was found to have a negative relationship with distance from the LCR.
Yersinia pestis has caused at least three human plague pandemics. The second (Black Death, 14-17th centuries) and third (19-20th centuries) have been genetically characterised, but there is only a limited understanding of the first pandemic, the Plague of Justinian (6-8th centuries). To address this gap, we sequenced and analysed draft genomes of Y pestis obtained from two individuals who died in the first pandemic.
Linking environmental, socioeconomic and health datasets provides new insights into the potential associations between climate change and human health and wellbeing, and underpins the development of decision support tools that will promote resilience to climate change, and thus enable more effective adaptation. This paper outlines the challenges and opportunities presented by advances in data collection, storage, analysis, and access, particularly focusing on "data mashups". These data mashups are integrations of different types and sources of data, frequently using open application programming interfaces and data sources, to produce enriched results that were not necessarily the original reason for assembling the raw source data. As an illustration of this potential, this paper describes a recently funded initiative to create such a facility in the UK for use in decision support around climate change and health, and provides examples of suitable sources of data and the purposes to which they can be directed, particularly for policy makers and public health decision makers.
In the 19th century, there were several major cholera pandemics in the Indian subcontinent, Europe, and North America. The causes of these outbreaks and the genomic strain identities remain a mystery. We used targeted high-throughput sequencing to reconstruct the Vibrio cholerae genome from the preserved intestine of a victim of the 1849 cholera outbreak in Philadelphia, part of the second cholera pandemic. This O1 biotype strain has 95 to 97% similarity with the classical O395 genome, differing by 203 single-nucleotide polymorphisms (SNPs), lacking three genomic islands, and probably having one or more tandem cholera toxin prophage (CTX) arrays, which potentially affected its virulence. This result highlights archived medical remains as a potential resource for investigations into the genomic origins of past pandemics.
A whole-genome sequencing technique developed to identify fast neutron-induced deletion mutations revealed that iap1-1 is a new allele of EDS5 (eds5-5). RPS2-AvrRpt2-initiated effector-triggered immunity (ETI) was compromised in iap1-1/eds5-5 with respect to in planta bacterial levels and the hypersensitive response, while intra- and intercellular free salicylic acid (SA) accumulation was greatly reduced, suggesting that SA contributes as both an intracellular signaling molecule and an antimicrobial agent in the intercellular space during ETI. During the compatible interaction between wild-type Col-0 and virulent Pseudomonas syringae pv. tomato (Pst), little intercellular free SA accumulated, which led to the hypothesis that Pst suppresses intercellular SA accumulation. When Col-0 was inoculated with a coronatine-deficient strain of Pst, high levels of intercellular SA accumulation were observed, suggesting that Pst suppresses intercellular SA accumulation using its phytotoxin coronatine. This work suggests that accumulation of SA in the intercellular space is an important component of basal/PAMP-triggered immunity as well as ETI to pathogens that colonize the intercellular space.
A critical question in biology is the identification of functionally important amino acid sites in proteins. Because functionally important sites are under stronger purifying selection, site-specific substitution rates tend to be lower than usual at these sites. A large number of phylogenetic models have been developed to estimate site-specific substitution rates in proteins and the extraordinarily low substitution rates have been used as evidence of function. Most of the existing tools, e.g. Rate4Site, assume that site-specific substitution rates are independent across sites. However, site-specific substitution rates may be strongly correlated in the protein tertiary structure, since functionally important sites tend to be clustered together to form functional patches. We have developed a new model, GP4Rate, which incorporates the Gaussian process model with the standard phylogenetic model to identify slowly evolved regions in protein tertiary structures. GP4Rate uses the Gaussian process to define a nonparametric prior distribution of site-specific substitution rates, which naturally captures the spatial correlation of substitution rates. Simulations suggest that GP4Rate can potentially estimate site-specific substitution rates with a much higher accuracy than Rate4Site and tends to report slowly evolved regions rather than individual sites. In addition, GP4Rate can estimate the strength of the spatial correlation of substitution rates from the data. By applying GP4Rate to a set of mammalian B7-1 genes, we found a highly conserved region which coincides with experimental evidence. GP4Rate may be a useful tool for the in silico prediction of functionally important regions in the proteins with known structures.
The investigation of extremophile plant species growing in their natural environment offers certain advantages, chiefly that plants adapted to severe habitats have a repertoire of stress tolerance genes that are regulated to maximize plant performance under physiologically challenging conditions. Accordingly, transcriptome sequencing offers a powerful approach to address questions concerning the influence of natural habitat on the physiology of an organism. We used RNA sequencing of Eutrema salsugineum, an extremophile relative of Arabidopsis thaliana, to investigate the extent to which genetic variation and controlled versus natural environments contribute to differences between transcript profiles.
DNA sequencing of ancient permafrost samples can be used to reconstruct past plant, animal and bacterial communities. In this study, we assess the small-scale reproducibility of taxonomic composition obtained from sequencing four molecular markers (mitochondrial 12S ribosomal DNA (rDNA), prokaryote 16S rDNA, mitochondrial cox1 and chloroplast trnL intron) from two soil cores sampled 10 cm apart. In addition, sequenced control reactions were used to produce a contaminant library that was used to filter similar sequences from sample libraries. Contaminant filtering resulted in the removal of 1% of reads or 0.3% of operational taxonomic units. We found similar richness, overlap, abundance and taxonomic diversity from the 12S, 16S and trnL markers from each soil core. Jaccard dissimilarity across the two soil cores was highest for metazoan taxa detected by the 12S and cox1 markers. Taxonomic community distances were similar for each marker across the two soil cores when the chi-squared metric was used; however, the 12S and cox1 markers did not cluster well when the Goodall similarity metric was used. A comparison of plant macrofossil vs. read abundance corroborates previous work that suggests eastern Beringia was dominated by grasses and forbs during cold stages of the Pleistocene, a habitat that is restricted to isolated sites in the present-day Yukon.
DNA microarrays have become ubiquitous in biological and medical research. The most difficult problem that needs to be solved is the design of DNA oligonucleotides that (i) are highly specific, that is, bind only to the intended target, (ii) cover the highest possible number of genes, that is, all genes that allow such unique regions, and (iii) are computed fast. None of the existing programs meet all these criteria.
Bioinformatics includes a suite of methods, which are cheap, approachable, and many of which are easily accessible without any sort of specialized bioinformatic training. Yet, despite this, bioinformatic tools are under-utilized by immunologists. Herein, we review a representative set of publicly available, easy-to-use bioinformatic tools using our own research on an under-annotated human gene, SCARA3, as an example. SCARA3 shares an evolutionary relationship with the class A scavenger receptors, but preliminary research showed that it was divergent enough that its function remained unclear. In our quest for more information about this gene - did it share gene sequence similarities to other scavenger receptors? Did it contain conserved protein domains? Where was it expressed in the human body? - we discovered the power and informative potential of publicly available bioinformatic tools designed for the novice in mind, which allowed us to hypothesize on the regulation, structure, and function of this protein. We argue that these tools are largely applicable to many facets of immunology research.
A number of statistical phylogenetic methods have been proposed to identify type-I functional divergence in duplicate genes by detecting heterogeneous substitution rates in phylogenetic trees. A common disadvantage of the existing methods is that autocorrelation of substitution rates along sequences is not modeled. This reduces the power of existing methods to identify regions under functional divergence.
• The internal transcribed spacer (ITS) of the nuclear ribosomal DNA region is a widely used species marker for plants and fungi. Recent metagenomic studies using next-generation sequencing, however, generate only partial ITS sequences. Here we compare the performance of partial and full-length ITS sequences with several classification methods. • We compiled a full-length ITS data set and created short fragments to simulate the read lengths commonly recovered from current next-generation sequencing platforms. We compared recovery, erroneous recovery, and coverage for the following methods: best BLAST hit classification, MEGAN classification, and automated phylogenetic assignment using the Statistical Assignment Program (SAP). • We found that summarizing results with more inclusive taxonomic ranks increased recovery and reduced erroneous recovery. The similarity-based methods BLAST and MEGAN performed consistently across most fragment lengths. Using a phylogeny-based method, SAP runs with queries 400 bp or longer worked best. Overall, BLAST had the highest recovery rates and MEGAN had the lowest erroneous recovery rates. • A high-throughput ITS classification method should be selected, taking into consideration read length, an acceptable tradeoff between maximizing the total number of classifications and minimizing the number of erroneous classifications, and the computational speed of the assignment method.
Technological advances in DNA recovery and sequencing have drastically expanded the scope of genetic analyses of ancient specimens to the extent that full genomic investigations are now feasible and are quickly becoming standard. This trend has important implications for infectious disease research because genomic data from ancient microbes may help to elucidate mechanisms of pathogen evolution and adaptation for emerging and re-emerging infections. Here we report a reconstructed ancient genome of Yersinia pestis at 30-fold average coverage from Black Death victims securely dated to episodes of pestilence-associated mortality in London, England, 1348-1350. Genetic architecture and phylogenetic analysis indicate that the ancient organism is ancestral to most extant strains and sits very close to the ancestral node of all Y. pestis commonly associated with human infection. Temporal estimates suggest that the Black Death of 1347-1351 was the main historical event responsible for the introduction and widespread dissemination of the ancestor to all currently circulating Y. pestis strains pathogenic to humans, and further indicates that contemporary Y. pestis epidemics have their origins in the medieval era. Comparisons against modern genomes reveal no unique derived positions in the medieval organism, indicating that the perceived increased virulence of the disease during the Black Death may not have been due to bacterial phenotype. These findings support the notion that factors other than microbial genetics, such as environment, vector dynamics and host susceptibility, should be at the forefront of epidemiological discussions regarding emerging Y. pestis infections.
The LysR protein PcaQ regulates the expression of genes encoding products relevant to the degradation of the aromatic acid protocatechuate (3,4-dihydroxybenzoate), and we have previously defined a PcaQ DNA-binding site located upstream of the target pcaDCHGB operon in Sinorhizobium meliloti. In this work, we show that PcaQ also regulates the expression of the S. meliloti smb20568-smb20787-smb20786-smb20785-smb20784 gene cluster, which is predicted to encode an ABC transport system. ABC transport systems have not been shown before to transport protocatechuate, and we have designated this gene cluster pcaMNVWX. The transcriptional start site of pcaM was mapped, and the predicted PcaQ DNA-binding site was located at -73 to -58 relative to this site. Results from electrophoretic mobility shift assays with purified PcaQ and from expression assays indicated that PcaQ activates expression of the transport system in the presence of protocatechuate. To investigate this transport system further, we generated a pcaM deletion mutant (predicted to encode the substrate-binding protein) and introduced a polar insertion mutation into pcaN, a gene that is predicted to encode a permease. These mutants grew poorly on protocatechuate, presumably because they fail to transport protocatechuate. Genome analyses revealed PcaQ-like DNA-binding sites encoded upstream of ABC transport systems in other members of the ?-proteobacteria, and thus it appears likely that these systems are involved in the uptake of protocatechuate.
Low-complexity regions (LCRs) within proteins sequences are often considered to evolve neutrally even though recent studies reported evidence for selection acting on some of them. Because of their widespread distribution among eukaryotes genomes and the potential deleterious effect of expansion/contraction of some of them in humans, low-complexity sequences are of major interest and numerous studies have attempted to describe their dynamic between genomes as well as the factors correlated to their variation and to assess their selective value. However, due to the scarcity of individual genomes within a species, most of the analyses so far have been performed at the species level with the implicit assumption that the variation both in composition and size within species is too small relative to the between-species divergence to affect the conclusions of the analysis. Here we used the available genomes of 14 Plasmodium falciparum isolates to assess the relationship between low-complexity sequence variation and factors such as nucleotide polymorphism across strains, sequence composition, and protein expression. We report that more than half of the 7,711 low-complexity sequences found within aligned coding sequences are variable in size among strains. Across strains, we observed an increasing density of polymorphic sites toward the LCR boundaries. This observation strongly suggests the joint effects of lowered selective constraints on low-complexity sequences and a mutagenic effect of these simple sequences.
Indirect tests have detected recombination in mitochondrial DNA (mtDNA) from many animal lineages, including mammals. However, it is possible that features of the molecular evolutionary process without recombination could be incorrectly inferred by indirect tests as being due to recombination. We have identified one such example, which we call "patchy-tachy" (PT), where different partitions of sequences evolve at different rates, that leads to an excess of false positives for recombination inferred by indirect tests. To explore this phenomena, we characterized the false positive rates of six widely used indirect tests for recombination using simulations of general models for mtDNA evolution with PT but without recombination. All tests produced 30-99% false positives for recombination, although the conditions that produced the maximal level of false positives differed between the tests. To evaluate the degree to which conditions that exacerbate false positives are found in published sequence data, we turned to 20 animal mtDNA data sets in which recombination is suggested by indirect tests. Using a model where different regions of the sequences were free to evolve at different rates in different lineages, we demonstrated that PT is prevalent in many data sets in which recombination was previously inferred using indirect tests. Taken together, our results argue that PT without recombination is a viable alternative explanation for detection of widespread recombination in animal mtDNA using indirect tests.
The discovery of antibiotics more than 70 years ago initiated a period of drug innovation and implementation in human and animal health and agriculture. These discoveries were tempered in all cases by the emergence of resistant microbes. This history has been interpreted to mean that antibiotic resistance in pathogenic bacteria is a modern phenomenon; this view is reinforced by the fact that collections of microbes that predate the antibiotic era are highly susceptible to antibiotics. Here we report targeted metagenomic analyses of rigorously authenticated ancient DNA from 30,000-year-old Beringian permafrost sediments and the identification of a highly diverse collection of genes encoding resistance to ?-lactam, tetracycline and glycopeptide antibiotics. Structure and function studies on the complete vancomycin resistance element VanA confirmed its similarity to modern variants. These results show conclusively that antibiotic resistance is a natural phenomenon that predates the modern selective pressure of clinical antibiotic use.
For decades proteins were thought to interact in a "lock and key" system, which led to the definition of a paradigm linking stable three-dimensional structure to biological function. As a consequence, any non-structured peptide was considered to be nonfunctional and to evolve neutrally. Surprisingly, the most commonly shared peptides between eukaryotic proteomes are low-complexity sequences that in most conditions do not present a stable three-dimensional structure. However, because these sequences evolve rapidly and because the size variation of a few of them can have deleterious effects, low-complexity sequences have been suggested to be the target of selection. Here we review evidence that supports the idea that these simple sequences should not be considered just "junk" peptides and that selection drives the evolution of many of them.
Bacterial gene content variation during the course of evolution has been widely acknowledged and its pattern has been actively modeled in recent years. Gene truncation or gene pseudogenization also plays an important role in shaping bacterial genome content. Truncated genes could also arise from small-scale lateral gene transfer events. Unfortunately, the information of truncated genes has not been considered in any existing mathematical models on gene content variation. In this study, we developed a model to incorporate truncated genes. Maximum-likelihood estimates (MLEs) of the new model reveal fast rates of gene insertions/deletions on recent branches, suggesting a fast turnover of many recently transferred genes. The estimates also suggest that many truncated genes are in the process of being eliminated from the genome. Furthermore, we demonstrate that the ignorance of truncated genes in the estimation does not lead to a systematic bias but rather has a more complicated effect. Analysis using the new model not only provides more accurate estimates on gene gains/losses (or insertions/deletions), but also reduces any concern of a systematic bias from applying simplified models to bacterial genome evolution. Although not a primary purpose, the model incorporating truncated genes could be potentially used for phylogeny reconstruction using gene family content.
The functional effects of most amino acid replacements accumulated during molecular evolution are unknown, because most are not observed naturally and the possible combinations are too numerous. We created 168 single mutations in wild-type Escherichia coli isopropymalate dehydrogenase (IMDH) that match the differences found in wild-type Pseudomonas aeruginosa IMDH. 104 mutant enzymes performed similarly to E. coli wild-type IMDH, one was functionally enhanced, and 63 were functionally compromised. The transition from E. coli IMDH, or an ancestral form, to the functional wild-type P. aeruginosa IMDH requires extensive epistasis to ameliorate the combined effects of the deleterious mutations. This result stands in marked contrast with a basic assumption of molecular phylogenetics, that sites in sequences evolve independently of each other. Residues that affect function are scattered haphazardly throughout the IMDH structure. We screened for compensatory mutations at three sites, all of which lie near the active site and all of which are among the least active mutants. No compensatory mutations were found at two sites indicating that a single site may engage in compound epistatic interactions. One complete and three partial compensatory mutations of the third site are remote and lie in a different domain. This demonstrates that epistatic interactions can occur between distant (>20Å) sites. Phylogenetic analysis shows that incompatible mutations were fixed in different lineages.
Low complexity and homopolymer sequences within coding regions are known to evolve rapidly. While their expansion may be deleterious, there is increasing evidence for a functional role associated with these amino acid sequences. Homopolymer sequences are thought to evolve mostly through replication slippage and, therefore, they may be expected to be longer in regions with relaxed selective constraint. Within the coding sequences of eukaryotes, alternatively spliced exons are known to evolve under relaxed constraints in comparison to those exons that are constitutively spliced because they are not included in all of the mature mRNA of a gene. This relaxed exposure to selection leads to faster rates of evolution for alternatively spliced exons in comparison to constitutively spliced exons. Here, we have tested the effect of splicing on the structure (composition, length) of homopolymer sequences in relation to the splicing pattern in which they are found. We observed a significant relationship between alternative splicing and homopolymer sequences with alternatively spliced genes being enriched in number and length of homopolymer sequences. We also observed lower codon diversity and longer homocodons, suggesting a balance between slippage and point mutations linked to the constraints imposed by selection.
Barcoding is an initiative to define a standard fragment of DNA to be used to assign unknown sequences to existing known species groups that have been pre-identified externally (by a taxonomist). Several methods have been described that attempt to place this assignment into a Bayesian statistical framework. Here we describe an algorithm that makes use of segregating sites and we examine how well these methods perform in the absence of an interspecific barcoding gap. When a barcoding gap exists, that is when the data are clearly delimited, most methods perform well. Here we have used data from the Drosophila genus because this genus includes sibling species and the species relationships within this species while complex are, arguably, better understood than in any other group. The results show that the Bayesian methods perform well even in the absence of a barcoding gap. The sequences from Drosophila are correctly identified and only when the degree of incomplete lineage sorting is extreme in simulations or within the Drosophila species, do they fail in their identifications and even then, the "correct" species has a high posterior probability.
The availability of complete genome sequences for 12 Drosophila species provides an unprecedented resource for large-scale studies of genome evolution. In this study, we looked for correlated shifts in the patterns of genome and proteome evolution within the genus Drosophila. Specifically, we asked if the nucleotide composition of the Drosophila willistoni genome--which is significantly less GC rich than the other 11 sequenced Drosophila genomes--is reflected in an altered pattern of amino acid substitutions in the encoded proteins. Our results show that this is indeed the case: There are large and highly significant asymmetries in the patterns of amino acid substitution between D. willistoni and Drosophila melanogaster, and they are in the direction predicted by the nucleotide biases. The implication of this result, combined with previous studies on long-term proteome evolution, is that substitutional biases at the DNA level can be a major factor in determining both the long-term and the short-term directions of proteome evolution.
Lateral gene transfer (LGT) and gene rearrangement are essential for shaping bacterial genomes during evolution. Separate attention has been focused on understanding the process of lateral gene transfer and the process of gene translocation. However, little is known about how gene translocation affects laterally transferred genes. Here we have examined gene translocations and lateral gene transfers in closely related genome pairs. The results reveal that translocated genes undergo elevated rates of evolution and gene translocation tends to take place preferentially in recently acquired genes. Translocated genes have a high probability to be truncated, suggesting that translocation followed by truncation/deletion might play an important role in the fast turnover of laterally transferred genes. Furthermore, more recently acquired genes have a higher proportion of genes on the leading strand, suggesting a strong strand bias of lateral gene transfer.
As a consequence of alternative splicing, a genes exons will have different frequencies of inclusion into mature mRNA and different patterns of expression. These differences affect their patterns of evolutionary divergence. Using the recently reannotated genome of Drosophila melanogaster and the genome sequences of four closely related species of the melanogaster subgroup, we investigated the effect of alternative splicing, inclusion level (defined as the number of transcripts an exon is found in), and expression pattern on exon evolution across divergence times ranging from 1 to 12.5 Ma. Genes undergoing alternative splicing have a broader pattern of expression associated with a lower divergence rate in comparison with genes with a single annotated protein isoform. Within genes undergoing alternative splicing, we report a significant effect of inclusion level on exon evolution, as alternatively spliced exons are less conserved than constitutively spliced exons. More generally, there are significant negative correlations between inclusion level and exon evolutionary rates that can be associated with relaxation of selection. A significant effect of expression pattern on evolution rates is also observed. Overall, we found that similar selective factors such as the expression level and the pattern of expression are affecting both gene and exon evolution.
The class A scavenger receptors are a subclass of a diverse family of proteins defined based on their ability to bind modified lipoproteins. The 5 members of this family are strikingly variable in their protein structure and function, raising the question as to whether it is appropriate to group them as a family based on their ligand binding abilities.
Thellungiella salsuginea is an important model plant due to its natural tolerance to abiotic stresses including salt, cold, and water deficits. Microarray and metabolite profiling have shown that Thellungiella undergoes stress-responsive changes in transcript and organic solute abundance when grown under controlled environmental conditions. However, few reports assess the capacity of plants to display stress-responsive traits in natural habitats where concurrent stresses are the norm.
Barcoding is an initiative to define a standard fragment of DNA to be used to assign sequences of unknown origin to existing known species whose sequences are recorded in databases. This is a difficult task when species are closely related and individuals of these species might have more than one origin. Using a previously introduced Bayesian statistical tree-less assignment algorithm based on segregating sites, we examine how it functions in the presence of hidden population subdivision with closely related species using simulations. Not surprisingly, adding samples to the database from a greater proportion of the species range leads to a consistently higher number of accurate results. Without such samples, query sequences that originate from outside of the sampled range are easily misinterpreted as coming from other species. However, we show that even the addition of a single sample from a different subpopulation is sufficient to greatly increase the probability of placement of unknown queries into the correct species group. This study highlights the importance of broad sampling, even with five reference samples per species, in the creation of a reference database.
Nuclear large subunit ribosomal DNA is widely used in fungal phylogenetics and to an increasing extent also amplicon-based environmental sequencing. The relatively short reads produced by next-generation sequencing, however, makes primer choice and sequence error important variables for obtaining accurate taxonomic classifications. In this simulation study we tested the performance of three classification methods: 1) a similarity-based method (BLAST + Metagenomic Analyzer, MEGAN); 2) a composition-based method (Ribosomal Database Project naïve bayesian classifier, NBC); and, 3) a phylogeny-based method (Statistical Assignment Package, SAP). We also tested the effects of sequence length, primer choice, and sequence error on classification accuracy and perceived community composition. Using a leave-one-out cross validation approach, results for classifications to the genus rank were as follows: BLAST + MEGAN had the lowest error rate and was particularly robust to sequence error; SAP accuracy was highest when long LSU query sequences were classified; and, NBC runs significantly faster than the other tested methods. All methods performed poorly with the shortest 50-100 bp sequences. Increasing simulated sequence error reduced classification accuracy. Community shifts were detected due to sequence error and primer selection even though there was no change in the underlying community composition. Short read datasets from individual primers, as well as pooled datasets, appear to only approximate the true community composition. We hope this work informs investigators of some of the factors that affect the quality and interpretation of their environmental gene surveys.
Related JoVE Video
Journal of Visualized Experiments
What is Visualize?
JoVE Visualize is a tool created to match the last 5 years of PubMed publications to methods in JoVE's video library.
How does it work?
We use abstracts found on PubMed and match them to JoVE videos to create a list of 10 to 30 related methods videos.
Video X seems to be unrelated to Abstract Y...
In developing our video relationships, we compare around 5 million PubMed articles to our library of over 4,500 methods videos. In some cases the language used in the PubMed abstracts makes matching that content to a JoVE video difficult. In other cases, there happens not to be any content in our video library that is relevant to the topic of a given abstract. In these cases, our algorithms are trying their best to display videos with relevant content, which can sometimes result in matched videos with only a slight relation.