Insects are the most speciose group of animals, but the phylogenetic relationships of many major lineages remain unresolved. We inferred the phylogeny of insects from 1478 protein-coding genes. Phylogenomic analyses of nucleotide and amino acid sequences, with site-specific nucleotide or domain-specific amino acid substitution models, produced statistically robust and congruent results resolving previously controversial phylogenetic relations hips. We dated the origin of insects to the Early Ordovician [~479 million years ago (Ma)], of insect flight to the Early Devonian (~406 Ma), of major extant lineages to the Mississippian (~345 Ma), and the major diversification of holometabolous insects to the Early Cretaceous. Our phylogenomic study provides a comprehensive reliable scaffold for future comparative analyses of evolutionary innovations among insects.
Modern sequencing technology now allows biologists to collect the entirety of molecular evidence for reconstructing evolutionary trees. We introduce a novel, user-friendly software package engineered for conducting state-of-the-art Bayesian tree inferences on data sets of arbitrary size. Our software introduces a nonblocking parallelization of Metropolis-coupled chains, modifications for efficient analyses of data sets comprising thousands of partitions and memory saving techniques. We report on first experiences with Bayesian inferences at the whole-genome level using the SuperMUC supercomputer and simulated data.
Partitioning involves estimating independent models of molecular evolution for different subsets of sites in a sequence alignment, and has been shown to improve phylogenetic inference. Current methods for estimating best-fit partitioning schemes, however, are only computationally feasible with datasets of fewer than 100 loci. This is a problem because datasets with thousands of loci are increasingly common in phylogenetics.
Despite considerable progress in systematics, a comprehensive scenario of the evolution of phenotypic characters in the mega-diverse Holometabola based on a solid phylogenetic hypothesis was still missing. We addressed this issue by de novo sequencing transcriptome libraries of representatives of all orders of holometabolan insects (13 species in total) and by using a previously published extensive morphological dataset. We tested competing phylogenetic hypotheses by analyzing various specifically designed sets of amino acid sequence data, using maximum likelihood (ML) based tree inference and Four-cluster Likelihood Mapping (FcLM). By maximum parsimony-based mapping of the morphological data on the phylogenetic relationships we traced evolutionary transformations at the phenotypic level and reconstructed the groundplan of Holometabola and of selected subgroups.
Phylogenies inferred from different data matrices often conflict with each other necessitating the development of measures that quantify this incongruence. Here, we introduce novel measures that use information theory to quantify the degree of conflict or incongruence among all nontrivial bipartitions present in a set of trees. The first measure, internode certainty (IC), calculates the degree of certainty for a given internode by considering the frequency of the bipartition defined by the internode (internal branch) in a given set of trees jointly with that of the most prevalent conflicting bipartition in the same tree set. The second measure, IC All (ICA), calculates the degree of certainty for a given internode by considering the frequency of the bipartition defined by the internode in a given set of trees in conjunction with that of all conflicting bipartitions in the same underlying tree set. Finally, the tree certainty (TC) and TC All (TCA) measures are the sum of IC and ICA values across all internodes of a phylogeny, respectively. IC, ICA, TC, and TCA can be calculated from different types of data that contain nontrivial bipartitions, including from bootstrap replicate trees to gene trees or individual characters. Given a set of phylogenetic trees, the IC and ICA values of a given internode reflect its specific degree of incongruence, and the TC and TCA values describe the global degree of incongruence between trees in the set. All four measures are implemented and freely available in version 8.0.0 and subsequent versions of the widely used program RAxML.
New sequence data useful for phylogenetic and evolutionary analyses continues to be added to public databases. The construction of multiple sequence alignments and inference of huge phylogenies comprising large taxonomic groups are expensive tasks, both in terms of man hours and computational resources. Therefore, maintaining comprehensive phylogenies, based on representative and up-to-date molecular sequences, is challenging. PUmPER is a framework that can perpetually construct multi-gene alignments (with PHLAWD) and phylogenetic trees (with ExaML or RAxML-Light) for a given NCBI taxonomic group. When sufficient numbers of new gene sequences for the selected taxonomic group have accumulated in GenBank, PUmPER automatically extends the alignment and infers extended phylogenetic trees by using previously inferred smaller trees as starting topologies. Using our framework, large phylogenetic trees can be perpetually updated without human intervention. Importantly, resulting phylogenies are not statistically significantly worse than trees inferred from scratch.
Nucleotide positions in the hypervariable V4 and V9 regions of the small subunit (SSU)-rDNA locus are normally difficult to align and are usually removed before standard phylogenetic analyses. Yet, with next-generation sequencing data, amplicons of these regions are all that are available to answer ecological and evolutionary questions that rely on phylogenetic inferences. With ciliates, we asked how inclusion of the V4 or V9 regions, regardless of alignment quality, affects tree topologies using distinct phylogenetic methods (including PairDist that is introduced here). Results show that the best approach is to place V4 amplicons into an alignment of full-length Sanger SSU-rDNA sequences and to infer the phylogenetic tree with RAxML. A sliding window algorithm as implemented in RAxML shows, though, that not all nucleotide positions in the V4 region are better than V9 at inferring the ciliate tree. With this approach and an ancestral-state reconstruction, we use V4 amplicons from European nearshore sampling sites to infer that rather than being primarily terrestrial and freshwater, colpodean ciliates may have repeatedly transitioned from terrestrial/freshwater to marine environments.
Phylogenies are increasingly used in all fields of medical and biological research. Moreover, because of the next-generation sequencing revolution, datasets used for conducting phylogenetic analyses grow at an unprecedented pace. RAxML (Randomized Axelerated Maximum Likelihood) is a popular program for phylogenetic analyses of large datasets under maximum likelihood. Since the last RAxML paper in 2006, it has been continuously maintained and extended to accommodate the increasingly growing input datasets and to serve the needs of the user community.
The detection of positive selection is widely used to study gene and genome evolution, but its application remains limited by the high computational cost of existing implementations. We present a series of computational optimizations for more efficient estimation of the likelihood function on large-scale phylogenetic problems. We illustrate our approach using the branch-site model of codon evolution.
The Brassicaceae family (mustards or crucifers) includes Arabidopsis thaliana as one of the most important model species in plant biology and a number of important crop plants such as the various Brassica species (e.g. cabbage, canola and mustard). Moreover, the family comprises an increasing number of species that serve as study systems in many fields of plant science and evolutionary research. However, the systematics and taxonomy of the family are very complex and access to scientifically valuable and reliable information linked to species and genus names and its interpretation are often difficult. BrassiBase is a continuously developing and growing knowledge database (http://brassibase.cos.uni-heidelberg.de) that aims at providing direct access to many different types of information ranging from taxonomy and systematics to phylo- and cytogenetics. Providing critically revised key information, the database intends to optimize comparative evolutionary research in this family and supports the introduction of the Brassicaceae as the model family for evolutionary biology and plant sciences. Some features that should help to accomplish these goals within a comprehensive taxonomic framework have now been implemented in the new version 1.1.9. A Phylogenetic Placement Tool should help to identify critical accessions and germplasm and provide a first visualization of phylogenetic relationships. The Cytogenetics Tool provides in-depth information on genome sizes, chromosome numbers and polyploidy, and sets this information into a Brassicaceae-wide context.
The Illumina paired-end sequencing technology can generate reads from both ends of target DNA fragments, which can subsequently be merged to increase the overall read length. There already exist tools for merging these paired-end reads when the target fragments are equally long. However, when fragment lengths vary and, in particular, when either the fragment size is shorter than a single-end read, or longer than twice the size of a single-end read, most state-of-the-art mergers fail to generate reliable results. Therefore, a robust tool is needed to merge paired-end reads that exhibit varying overlap lengths because of varying target fragment lengths.
Phylogenetic relationships of the primarily wingless insects are still considered unresolved. Even the most comprehensive phylogenomic studies that addressed this question did not yield congruent results. To get a grip on these problems, we here analyzed the sources of incongruence in these phylogenomic studies by using an extended transcriptome data set. Our analyses showed that unevenly distributed missing data can be severely misleading by inflating node support despite the absence of phylogenetic signal. In consequence, only decisive data sets should be used which exclusively comprise data blocks containing all taxa whose relationships are addressed. Additionally, we used Four-cluster Likelihood Mapping (FcLM) to measure the degree of congruence among genes of a data set, as a measure of support alternative to bootstrap. FcLM showed incongruent signal among genes, which in our case is correlated neither with functional class assignment of these genes nor with model misspecification due to unpartitioned analyses. The herein analyzed data set is the currently largest data set covering primarily wingless insects, but failed to elucidate their interordinal phylogenetic relationships. Although this is unsatisfying from a phylogenetic perspective, we try to show that the analyses of structure and signal within phylogenomic data can protect us from biased phylogenetic inferences due to analytical artifacts.
Sequence-based methods to delimit species are central to DNA taxonomy, microbial community surveys and DNA metabarcoding studies. Current approaches either rely on simple sequence similarity thresholds (OTU-picking) or on complex and compute-intensive evolutionary models. The OTU-picking methods scale well on large datasets, but the results are highly sensitive to the similarity threshold. Coalescent-based species delimitation approaches often rely on Bayesian statistics and Markov Chain Monte Carlo sampling, and can therefore only be applied to small datasets.
To quantify known and unknown microorganisms at species-level resolution using shotgun sequencing data, we developed a method that establishes metagenomic operational taxonomic units (mOTUs) based on single-copy phylogenetic marker genes. Applied to 252 human fecal samples, the method revealed that on average 43% of the species abundance and 58% of the richness cannot be captured by current reference genome-based methods. An implementation of the method is available at http://www.bork.embl.de/software/mOTU/.
The advent of modern DNA sequencing technology is the driving force in obtaining complete intra-specific genomes that can be used to detect loci that have been subject to positive selection in the recent past. Based on selective sweep theory, beneficial loci can be detected by examining the single nucleotide polymorphism patterns in intraspecific genome alignments. In the last decade, a plethora of algorithms for identifying selective sweeps have been developed. However, the majority of these algorithms have not been designed for analyzing whole-genome data. We present SweeD (Sweep Detector), an open-source tool for the rapid detection of selective sweeps in whole genomes. It analyzes site frequency spectra and represents a substantial extension of the widely used SweepFinder program. The sequential version of SweeD is up to 22 times faster than SweepFinder and, more importantly, is able to analyze thousands of sequences. We also provide a parallel implementation of SweeD for multi-core processors. Furthermore, we implemented a checkpointing mechanism that allows to deploy SweeD on cluster systems with queue execution time restrictions, as well as to resume long-running analyses after processor failures. In addition, the user can specify various demographic models via the command-line to calculate their theoretically expected site frequency spectra. Therefore, (in contrast to SweepFinder) the neutral site frequencies can optionally be directly calculated from a given demographic model. We show that an increase of sample size results in more precise detection of positive selection. Thus, the ability to analyze substantially larger sample sizes by using SweeD leads to more accurate sweep detection. We validate SweeD via simulations and by scanning the first chromosome from the 1000 human Genomes project for selective sweeps. We compare SweeD results with results from a linkage-disequilibrium-based approach and identify common outliers.
In population genetics, simulation is a fundamental tool for analyzing how basic evolutionary forces such as natural selection, recombination, and mutation shape the genetic landscape of a population. Forward simulation represents the most powerful, but, at the same time, most compute-intensive approach for simulating the genetic material of a population.
The third Heidelberg Unseminars in Bioinformatics (HUB) was held on 18th October 2012, at Heidelberg University, Germany. HUB brought together around 40 bioinformaticians from academia and industry to discuss the Biggest Challenges in Bioinformatics in a World Café style event.
We report a daily-updated sequenced/species Tree Of Life (sTOL) as a reference for the increasing number of cellular organisms with their genomes sequenced. The sTOL builds on a likelihood-based weight calibration algorithm to consolidate NCBI taxonomy information in concert with unbiased sampling of molecular characters from whole genomes of all sequenced organisms. Via quantifying the extent of agreement between taxonomic and molecular data, we observe there are many potential improvements that can be made to the status quo classification, particularly in the Fungi kingdom; we also see that the current state of many animal genomes is rather poor. To augment the use of sTOL in providing evolutionary contexts, we integrate an ontology infrastructure and demonstrate its utility for evolutionary understanding on: nuclear receptors, stem cells and eukaryotic genomes. The sTOL (http://supfam.org/SUPERFAMILY/sTOL) provides a binary tree of (sequenced) life, and contributes to an analytical platform linking genome evolution, function and phenotype.
Highly accurate estimation of phylogenetic trees for large data sets is difficult, in part because multiple sequence alignments must be accurate for phylogeny estimation methods to be accurate. Coestimation of alignments and trees has been attempted but currently only SATé estimates reasonably accurate trees and alignments for large data sets in practical time frames (Liu K., Raghavan S., Nelesen S., Linder C.R., Warnow T. 2009b. Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science. 324:1561-1564). Here, we present a modification to the original SATé algorithm that improves upon SATé (which we now call SATé-I) in terms of speed and of phylogenetic and alignment accuracy. SATé-II uses a different divide-and-conquer strategy than SATé-I and so produces smaller more closely related subsets than SATé-I; as a result, SATé-II produces more accurate alignments and trees, can analyze larger data sets, and runs more efficiently than SATé-I. Generally, SATé is a metamethod that takes an existing multiple sequence alignment method as an input parameter and boosts the quality of that alignment method. SATé-II-boosted alignment methods are significantly more accurate than their unboosted versions, and trees based upon these improved alignments are more accurate than trees based upon the original alignments. Because SATé-I used maximum likelihood (ML) methods that treat gaps as missing data to estimate trees and because we found a correlation between the quality of tree/alignment pairs and ML scores, we explored the degree to which SATés performance depends on using ML with gaps treated as missing data to determine the best tree/alignment pair. We present two lines of evidence that using ML with gaps treated as missing data to optimize the alignment and tree produces very poor results. First, we show that the optimization problem where a set of unaligned DNA sequences is given and the output is the tree and alignment of those sequences that maximize likelihood under the Jukes-Cantor model is uninformative in the worst possible sense. For all inputs, all trees optimize the likelihood score. Second, we show that a greedy heuristic that uses GTR+Gamma ML to optimize the alignment and the tree can produce very poor alignments and trees. Therefore, the excellent performance of SATé-II and SATé-I is not because ML is used as an optimization criterion for choosing the best tree/alignment pair but rather due to the particular divide-and-conquer realignment techniques employed.
Remipedes are a small and enigmatic group of crustaceans, first described only 30 years ago. Analyses of both morphological and molecular data have recently suggested a close relationship between Remipedia and Hexapoda. If true, the remipedes occupy an important position in pancrustacean evolution and may be pivotal for understanding the evolutionary history of crustaceans and hexapods. However, it is important to test this hypothesis using new data and new types of analytical approaches. Here, we assembled a phylogenomic data set of 131 taxa, incorporating newly generated 454 expressed sequence tag (EST) data from six species of crustaceans, representing five lineages (Remipedia, Laevicaudata, Spinicaudata, Ostracoda, and Malacostraca). This data set includes all crustacean species for which EST data are available (46 species), and our largest alignment encompasses 866,479 amino acid positions and 1,886 genes. A series of phylogenomic analyses was performed to evaluate pancrustacean relationships. We significantly improved the quality of our data for predicting putative orthologous genes and for generating data subsets by matrix reduction procedures, thereby improving the signal to noise ratio in the data. Eight different data sets were constructed, representing various combinations of orthologous genes, data subsets, and taxa. Our results demonstrate that the different ways to compile an initial data set of core orthologs and the selection of data subsets by matrix reduction can have marked effects on the reconstructed phylogenetic trees. Nonetheless, all eight data sets strongly support Pancrustacea with Remipedia as the sister group to Hexapoda. This is the first time that a sister group relationship of Remipedia and Hexapoda has been inferred using a comprehensive phylogenomic data set that is based on EST data. We also show that selecting data subsets with increased overall signal can help to identify and prevent artifacts in phylogenetic analyses.
A novel result of the current research is the development and implementation of a unique functional phylogenomic approach that explores the genomic origins of seed plant diversification. We first use 22,833 sets of orthologs from the nuclear genomes of 101 genera across land plants to reconstruct their phylogenetic relationships. One of the more salient results is the resolution of some enigmatic relationships in seed plant phylogeny, such as the placement of Gnetales as sister to the rest of the gymnosperms. In using this novel phylogenomic approach, we were also able to identify overrepresented functional gene ontology categories in genes that provide positive branch support for major nodes prompting new hypotheses for genes associated with the diversification of angiosperms. For example, RNA interference (RNAi) has played a significant role in the divergence of monocots from other angiosperms, which has experimental support in Arabidopsis and rice. This analysis also implied that the second largest subunit of RNA polymerase IV and V (NRPD2) played a prominent role in the divergence of gymnosperms. This hypothesis is supported by the lack of 24nt siRNA in conifers, the maternal control of small RNA in the seeds of flowering plants, and the emergence of double fertilization in angiosperms. Our approach takes advantage of genomic data to define orthologs, reconstruct relationships, and narrow down candidate genes involved in plant evolution within a phylogenomic view of species diversification.
Likelihood-based methods for placing short read sequences from metagenomic samples into reference phylogenies have been recently introduced. At present, it is unclear how to align those reads with respect to the reference alignment that was deployed to infer the reference phylogeny. Moreover, the adaptability of such alignment methods with respect to the underlying reference alignment strategies/philosophies has not been explored. It has also not been assessed if the reference phylogeny can be deployed in conjunction with the reference alignment to improve alignment accuracy in this context.
The iPlant Collaborative (iPlant) is a United States National Science Foundation (NSF) funded project that aims to create an innovative, comprehensive, and foundational cyberinfrastructure in support of plant biology research (PSCIC, 2006). iPlant is developing cyberinfrastructure that uniquely enables scientists throughout the diverse fields that comprise plant biology to address Grand Challenges in new ways, to stimulate and facilitate cross-disciplinary research, to promote biology and computer science research interactions, and to train the next generation of scientists on the use of cyberinfrastructure in research and education. Meeting humanitys projected demands for agricultural and forest products and the expectation that natural ecosystems be managed sustainably will require synergies from the application of information technologies. The iPlant cyberinfrastructure design is based on an unprecedented period of research community input, and leverages developments in high-performance computing, data storage, and cyberinfrastructure for the physical sciences. iPlant is an open-source project with application programming interfaces that allow the community to extend the infrastructure to meet its needs. iPlant is sponsoring community-driven workshops addressing specific scientific questions via analysis tool integration and hypothesis testing. These workshops teach researchers how to add bioinformatics tools and/or datasets into the iPlant cyberinfrastructure enabling plant scientists to perform complex analyses on large datasets without the need to master the command-line or high-performance computational services.
The rapid accumulation of molecular sequence data, driven by novel wet-lab sequencing technologies, poses new challenges for large-scale maximum likelihood-based phylogenetic analyses on trees with more than 30,000 taxa and several genes. The three main computational challenges are: numerical stability, the scalability of search algorithms, and the high memory requirements for computing the likelihood.
We present an evolutionary placement algorithm (EPA) and a Web server for the rapid assignment of sequence fragments (short reads) to edges of a given phylogenetic tree under the maximum-likelihood model. The accuracy of the algorithm is evaluated on several real-world data sets and compared with placement by pair-wise sequence comparison, using edit distances and BLAST. We introduce a slow and accurate as well as a fast and less accurate placement algorithm. For the slow algorithm, we develop additional heuristic techniques that yield almost the same run times as the fast version with only a small loss of accuracy. When those additional heuristics are employed, the run time of the more accurate algorithm is comparable with that of a simple BLAST search for data sets with a high number of short query sequences. Moreover, the accuracy of the EPA is significantly higher, in particular when the sample of taxa in the reference topology is sparse or inadequate. Our algorithm, which has been integrated into RAxML, therefore provides an equally fast but more accurate alternative to BLAST for tree-based inference of the evolutionary origin and composition of short sequence reads. We are also actively developing a Web server that offers a freely available service for computing read placements on trees using the EPA.
How will the emerging possibility of inferring ultra-large phylogenies influence our ability to identify shifts in diversification rate? For several large angiosperm clades (Angiospermae, Monocotyledonae, Orchidaceae, Poaceae, Eudicotyledonae, Fabaceae, and Asteraceae), we explore this issue by contrasting two approaches: (1) using small backbone trees with an inferred number of extant species assigned to each terminal clade and (2) using a mega-phylogeny of 55473 seed plant species represented in GenBank. The mega-phylogeny approach assumes that the sample of species in GenBank is at least roughly proportional to the actual species diversity of different lineages, as appears to be the case for many major angiosperm lineages. Using both approaches, we found that diversification rate shifts are not directly associated with the major named clades examined here, with the sole exception of Fabaceae in the GenBank mega-phylogeny. These agreements are encouraging and may support a generality about angiosperm evolution: major shifts in diversification may not be directly associated with major named clades, but rather with clades that are nested not far within these groups. An alternative explanation is that there have been increased extinction rates in early-diverging lineages within these clades. Based on our mega-phylogeny, the shifts in diversification appear to be distributed quite evenly throughout the angiosperms. Mega-phylogenetic studies of diversification hold great promise for revealing new patterns, but we will need to focus more attention on properly specifying null expectation.
Many of the steps in phylogenetic reconstruction can be confounded by “rogue” taxa—taxa that cannot be placed with assurance anywhere within the tree, indeed, whose location within the tree varies with almost any choice of algorithm or parameters. Phylogenetic consensus methods, in particular, are known to suffer from this problem. In this paper, we provide a novel framework to define and identify rogue taxa. In this framework, we formulate a bicriterion optimization problem, the relative information criterion, that models the net increase in useful information present in the consensus tree when certain taxa are removed from the input data. We also provide an effective greedy heuristic to identify a subset of rogue taxa and use this heuristic in a series of experiments, with both pathological examples from the literature and a collection of large biological data sets. As the presence of rogue taxa in a set of bootstrap replicates can lead to deceivingly poor support values, we propose a procedure to recompute support values in light of the rogue taxa identified by our algorithm; applying this procedure to our biological data sets caused a large number of edges to move from “unsupported” to “supported” status, indicating that many existing phylogenies should be recomputed and reevaluated to reduce any inaccuracies introduced by rogue taxa. We also discuss the implementation issues encountered while integrating our algorithm into RAxML v7.2.7, particularly those dealing with scaling up the analyses. This integration enables practitioners to benefit from our algorithm in the analysis of very large data sets (up to 2,500 taxa and 10,000 trees, although we present the results of even larger analyses).
The associations between pathogens and their hosts are complex and can result from a variety of evolutionary processes including codivergence, lateral transfer, or duplication. Papillomaviruses (PVs) are double-stranded DNA viruses ubiquitously present in mammals and are a suitable target for rigorous statistical tests of potential virus-host codivergence. We analyze the evolutionary dynamics of PV diversification by comparing robust phylogenies of PVs and their respective hosts using different statistical approaches to assess topological and branch-length congruence. Mammalian PVs segregated into four diverse major clades that overlapped to varying degrees in terms of their mammalian host lineages. The hypothesis that PVs and hosts evolved independently was globally rejected (P = 0.0001), although only 90 of 207 virus-host associations (43%) were significant in individual tests. Virus-host codivergence accounted roughly for one-third of the evolutionary events required to reconcile PV-host evolutionary histories. When virus-host associations were analyzed locally within each of the four viral clades, numerous independent topological congruencies were identified that were incompatible with respect to the global trees. These results support an evolutionary scenario in which early PV radiation was followed by independent codivergence between viruses within each of the major clades and their hosts. Moreover, heterogeneous groups of closely related PVs infecting non-related hosts suggest several interspecies transmission events. Our results argue thus for the importance of alternative events in PV evolution, in contrast to the prevailing opinion that these viruses show a high degree of host specificity and codivergence.
Verification in phylogenetics represents an extremely difficult subject. Phylogenetic analysis deals with the reconstruction of evolutionary histories of species, and as long as mankind is not able to travel in time, it will not be possible to verify deep evolutionary histories reconstructed with modern computational methods. Here, we focus on two more tangible issues that are related to verification in phylogenetics (i) the inference of support values on trees that provide some notion about the correctness of the tree within narrow limits and, more importantly; (ii) issues pertaining to program verification, especially with respect to codes that rely heavily on floating-point arithmetics. Program verification represents a largely underestimated problem in computational science that can have fatal effects on scientific conclusions.
We present a novel method to encode ambiguously aligned regions in fixed multiple sequence alignments by Pairwise Identity and Cost Scores Ordination (PICS-Ord). The method works via ordination of sequence identity or cost scores matrices by means of Principal Coordinates Analysis (PCoA). After identification of ambiguous regions, the method computes pairwise distances as sequence identities or cost scores, ordinates the resulting distance matrix by means of PCoA, and encodes the principal coordinates as ordered integers. Three biological and 100 simulated datasets were used to assess the performance of the new method.
The evolutionary relationships among known Chlamydophila abortus variant strains including the LLG and POS, previously identified as being highly distinct, were investigated based on rRNA secondary structure information. PCR-amplified overlapping fragments of the 16S, 16S-23S intergenic spacer (IS), and 23S domain I rRNAs were subjected to cloning and sequencing. Secondary structure analysis revealed the presence of transitional single nucleotide variations (SNVs), two of which occurred in loops, while seven in stem regions that did not result in compensatory substitutions. Notably, only two SNVs, in 16S and 23S, occurred within evolutionary variable regions. Maximum likelihood and Bayesian phylogeny reconstructions revealed that C. abortus strains could be regarded as representing two distinct lineages, one including the "classical" C. abortus strains and the other the "LLG/POS variant", with the type strain B577(T) possibly representing an intermediate of the two lineages. The two C. abortus lineages shared three unique (apomorphic) characters in the 23S domain I and 16S-23S IS, but interestingly lacked synapomorphies in the 16S rRNA. The two lineages could be distinguished on the basis of eight positions; four of these comprised residues that appeared to be signature or unique for the "classical" lineage, while three were unique for the "LLG/POS variant". The U277 (E. coli numbering) signature character, corresponding to a highly conserved residue of the 16S molecule, and the unique G681 residue, conserved in a functionally strategic region also of 16S, are the most pronounced attributes (autapomorphies) of the "classical" and the "LLG/POS variant" lineages, respectively. Both lineages were found to be descendants of a common ancestor with the Prk/Daruma C. psittaci variant. Compared with the "classical", the "LLG/POS variant" lineage has retained more ancestral features. The current rRNA secondary structure-based analysis and phylogenetic inference reveal new insights into how these two C. abortus lineages have differentiated during their evolution.
The current molecular data explosion poses new challenges for large-scale phylogenomic analyses that can comprise hundreds or even thousands of genes. A property that characterizes phylogenomic datasets is that they tend to be gappy, i.e. can contain taxa with (many and disparate) missing genes. In current phylogenomic analyses, this type of alignment gappyness that is induced by missing data frequently exceeds 90%. We present and implement a generally applicable mechanism that allows for reducing memory footprints of likelihood-based [maximum likelihood (ML) or Bayesian] phylogenomic analyses proportional to the amount of missing data in the alignment. We also introduce a set of algorithmic rules to efficiently conduct tree searches via subtree pruning and re-grafting moves using this mechanism.
The constant accumulation of sequence data poses new computational and methodological challenges for phylogenetic inference, since multiple sequence alignments grow both in the horizontal (number of base pairs, phylogenomic alignments) as well as vertical (number of taxa) dimension. Put aside the ongoing controversial discussion about appropriate models, partitioning schemes, and assembly methods for phylogenomic alignments, coupled with the high computational cost to infer these, for many organismic groups, a sufficient number of taxa is often exclusively available from one or just a few genes (e.g., rbcL, matK, rDNA). In this paper we address scalability of Maximum-Likelihood-based phylogeny reconstruction with respect to the number of taxa by example of several large nested single-gene rbcL alignments comprising 400 up to 3,491 taxa. In order to test the effect of taxon sampling, we employ an appropriately adapted taxon jackknifing approach. In contrast to standard jackknifing, this taxon subsampling procedure is not conducted entirely at random, but based on drawing subsamples from empirical taxon-groups which can either be user-defined or determined by using taxonomic information from databases. Our results indicate that, despite an unfavorable number of sequences to number of base pairs ratio, i.e., many relatively short sequences, Maximum Likelihood tree searches and bootstrap analyses scale well on single-gene rbcL alignments with a dense taxon sampling up to several thousand sequences. Moreover, the newly implemented taxon subsampling procedure can be beneficial for inferring higher level relationships and interpreting bootstrap support from comprehensive analysis.
Shotgun sequencing of environmental DNA is an essential technique for characterizing uncultivated microbes in situ. However, the taxonomic and functional assignment of the obtained sequence fragments remains a pressing problem.
Phylogenetic bootstrapping (BS) is a standard technique for inferring confidence values on phylogenetic trees that is based on reconstructing many trees from minor variations of the input data, trees called replicates. BS is used with all phylogenetic reconstruction approaches, but we focus here on one of the most popular, maximum likelihood (ML). Because ML inference is so computationally demanding, it has proved too expensive to date to assess the impact of the number of replicates used in BS on the relative accuracy of the support values. For the same reason, a rather small number (typically 100) of BS replicates are computed in real-world studies. Stamatakis et al. recently introduced a BS algorithm that is 1 to 2 orders of magnitude faster than previous techniques, while yielding qualitatively comparable support values, making an experimental study possible. In this article, we propose stopping criteria--that is, thresholds computed at runtime to determine when enough replicates have been generated--and we report on the first large-scale experimental study to assess the effect of the number of replicates on the quality of support values, including the performance of our proposed criteria. We run our tests on 17 diverse real-world DNA--single-gene as well as multi-gene--datasets, which include 125-2,554 taxa. We find that our stopping criteria typically stop computations after 100-500 replicates (although the most conservative criterion may continue for several thousand replicates) while producing support values that correlate at better than 99.5% with the reference values on the best ML trees. Significantly, we also find that the stopping criteria can recommend very different numbers of replicates for different datasets of comparable sizes. Our results are thus twofold: (i) they give the first experimental assessment of the effect of the number of BS replicates on the quality of support values returned through BS, and (ii) they validate our proposals for stopping criteria. Practitioners will no longer have to enter a guess nor worry about the quality of support values; moreover, with most counts of replicates in the 100-500 range, robust BS under ML inference becomes computationally practical for most datasets. The complete test suite is available at http://lcbb.epfl.ch/BS.tar.bz2, and BS with our stopping criteria is included in the latest release of RAxML v7.2.5, available at http://wwwkramer.in.tum.de/exelixis/software.html.
A clear picture of animal relationships is a prerequisite to understand how the morphological and ecological diversity of animals evolved over time. Among others, the placement of the acoelomorph flatworms, Acoela and Nemertodermatida, has fundamental implications for the origin and evolution of various animal organ systems. Their position, however, has been inconsistent in phylogenetic studies using one or several genes. Furthermore, Acoela has been among the least stable taxa in recent animal phylogenomic analyses, which simultaneously examine many genes from many species, while Nemertodermatida has not been sampled in any phylogenomic study. New sequence data are presented here from organisms targeted for their instability or lack of representation in prior analyses, and are analysed in combination with other publicly available data. We also designed new automated explicit methods for identifying and selecting common genes across different species, and developed highly optimized supercomputing tools to reconstruct relationships from gene sequences. The results of the work corroborate several recently established findings about animal relationships and provide new support for the placement of other groups. These new data and methods strongly uphold previous suggestions that Acoelomorpha is sister clade to all other bilaterian animals, find diminishing evidence for the placement of the enigmatic Xenoturbella within Deuterostomia, and place Cycliophora with Entoprocta and Ectoprocta. The work highlights the implications that these arrangements have for metazoan evolution and permits a clearer picture of ancestral morphologies and life histories in the deep past.
The presence of rogue taxa (rogues) in a set of trees can frequently have a negative impact on the results of a bootstrap analysis (e.g., the overall support in consensus trees). We introduce an efficient graph-based algorithm for rogue taxon identification as well as an interactive webservice implementing this algorithm. Compared with our previous method, the new algorithm is up to 4 orders of magnitude faster, while returning qualitatively identical results. Because of this significant improvement in scalability, the new algorithm can now identify substantially more complex and compute-intensive rogue taxon constellations. On a large and diverse collection of real-world data sets, we show that our method yields better supported reduced/pruned consensus trees than any competing rogue taxon identification method. Using the parallel version of our open-source code, we successfully identified rogue taxa in a set of 100 trees with 116 334 taxa each. For simulated data sets, we show that when removing/pruning rogue taxa with our method from a tree set, we consistently obtain bootstrap consensus trees as well as maximum-likelihood trees that are topologically closer to the respective true trees.
Aligning short DNA reads to a reference sequence alignment is a prerequisite for detecting their biological origin and analyzing them in a phylogenetic context. With the PaPaRa tool we introduced a dedicated dynamic programming algorithm for simultaneously aligning short reads to reference alignments and corresponding evolutionary reference trees. The algorithm aligns short reads to phylogenetic profiles that correspond to the branches of such a reference tree. The algorithm needs to perform an immense number of pairwise alignments. Therefore, we explore vector intrinsics and GPUs to accelerate the PaPaRa alignment kernel.
In the age of whole-genome population genetics, so-called genomic scan studies often conclude with a long list of putatively selected loci. These lists are then further scrutinized to annotate these regions by gene function, corresponding biological processes, expression levels, or gene networks. Such annotations are often used to assess and/or verify the validity of the genome scan and the statistical methods that have been used to perform the analyses. Furthermore, these results are frequently considered to validate "true-positives" if the identified regions make biological sense a posteriori. Here, we show that this approach can be potentially misleading. By simulating neutral evolutionary histories, we demonstrate that it is possible not only to obtain an extremely high false-positive rate but also to make biological sense out of the false-positives and construct a sensible biological narrative. Results are compared with a recent polymorphism data set from Drosophila melanogaster.
We have developed a unified format for phylogenetic placements, that is, mappings of environmental sequence data (e.g., short reads) into a phylogenetic tree. We are motivated to do so by the growing number of tools for computing and post-processing phylogenetic placements, and the lack of an established standard for storing them. The format is lightweight, versatile, extensible, and is based on the JSON format, which can be parsed by most modern programming languages. Our format is already implemented in several tools for computing and post-processing parsimony- and likelihood-based phylogenetic placements and has worked well in practice. We believe that establishing a standard format for analyzing read placements at this early stage will lead to a more efficient development of powerful and portable post-analysis tools for the growing applications of phylogenetic placement.
Related JoVE Video
Journal of Visualized Experiments
What is Visualize?
JoVE Visualize is a tool created to match the last 5 years of PubMed publications to methods in JoVE's video library.
How does it work?
We use abstracts found on PubMed and match them to JoVE videos to create a list of 10 to 30 related methods videos.
Video X seems to be unrelated to Abstract Y...
In developing our video relationships, we compare around 5 million PubMed articles to our library of over 4,500 methods videos. In some cases the language used in the PubMed abstracts makes matching that content to a JoVE video difficult. In other cases, there happens not to be any content in our video library that is relevant to the topic of a given abstract. In these cases, our algorithms are trying their best to display videos with relevant content, which can sometimes result in matched videos with only a slight relation.