Insects are the most speciose group of animals, but the phylogenetic relationships of many major lineages remain unresolved. We inferred the phylogeny of insects from 1478 protein-coding genes. Phylogenomic analyses of nucleotide and amino acid sequences, with site-specific nucleotide or domain-specific amino acid substitution models, produced statistically robust and congruent results resolving previously controversial phylogenetic relations hips. We dated the origin of insects to the Early Ordovician [~479 million years ago (Ma)], of insect flight to the Early Devonian (~406 Ma), of major extant lineages to the Mississippian (~345 Ma), and the major diversification of holometabolous insects to the Early Cretaceous. Our phylogenomic study provides a comprehensive reliable scaffold for future comparative analyses of evolutionary innovations among insects.
Myriapods had been considered closely allied to hexapods (insects and relatives). However, analyses of molecular sequence data have consistently placed Myriapoda either as a sister group of Pancrustacea, comprising crustaceans and hexapods, and thereby supporting the monophyly of Mandibulata, or retrieved Myriapoda as a sister group of Chelicerata (spiders, ticks, mites and allies). In addition, the relationships among the four myriapod groups (Pauropoda, Symphyla, Diplopoda, Chilopoda) are unclear. To resolve the phylogeny of myriapods and their relationship to other main arthropod groups, we collected transcriptome data from the symphylan Symphylella vulgaris, the centipedes Lithobius forficatus and Scolopendra dehaani, and the millipedes Polyxenus lagurus, Glomeris pustulata and Polydesmus angustus by 454 sequencing. We concatenated a multiple sequence alignment that contained 1550 orthologous single copy genes (1,109,847 amino acid positions) from 55 euarthropod and 14 outgroup taxa. The final selected alignment included 181 genes and 37,425 amino acid positions from 55 taxa, with eight myriapods and 33 other euarthropods. Bayesian analyses robustly recovered monophyletic Mandibulata, Pancrustacea and Myriapoda. Most analyses support a sister group relationship of Symphyla in respect to a clade comprising Chilopoda and Diplopoda. Inclusion of additional sequence data from nine myriapod species resulted in an alignment with poor data density, but broader taxon average. With this dataset we inferred Diplopoda+Pauropoda as closest relatives (i.e., Dignatha) and recovered monophyletic Helminthomorpha. Molecular clock calculations suggest an early Cambrian emergence of Myriapoda ?513 million years ago and a late Cambrian divergence of myriapod classes. This implies a marine origin of the myriapods and independent terrestrialization events during myriapod evolution.
Despite considerable progress in systematics, a comprehensive scenario of the evolution of phenotypic characters in the mega-diverse Holometabola based on a solid phylogenetic hypothesis was still missing. We addressed this issue by de novo sequencing transcriptome libraries of representatives of all orders of holometabolan insects (13 species in total) and by using a previously published extensive morphological dataset. We tested competing phylogenetic hypotheses by analyzing various specifically designed sets of amino acid sequence data, using maximum likelihood (ML) based tree inference and Four-cluster Likelihood Mapping (FcLM). By maximum parsimony-based mapping of the morphological data on the phylogenetic relationships we traced evolutionary transformations at the phenotypic level and reconstructed the groundplan of Holometabola and of selected subgroups.
Phylogenetic relationships of the primarily wingless insects are still considered unresolved. Even the most comprehensive phylogenomic studies that addressed this question did not yield congruent results. To get a grip on these problems, we here analyzed the sources of incongruence in these phylogenomic studies by using an extended transcriptome data set. Our analyses showed that unevenly distributed missing data can be severely misleading by inflating node support despite the absence of phylogenetic signal. In consequence, only decisive data sets should be used which exclusively comprise data blocks containing all taxa whose relationships are addressed. Additionally, we used Four-cluster Likelihood Mapping (FcLM) to measure the degree of congruence among genes of a data set, as a measure of support alternative to bootstrap. FcLM showed incongruent signal among genes, which in our case is correlated neither with functional class assignment of these genes nor with model misspecification due to unpartitioned analyses. The herein analyzed data set is the currently largest data set covering primarily wingless insects, but failed to elucidate their interordinal phylogenetic relationships. Although this is unsatisfying from a phylogenetic perspective, we try to show that the analyses of structure and signal within phylogenomic data can protect us from biased phylogenetic inferences due to analytical artifacts.
Character matrices with extensive missing data are frequently used in phylogenomics with potentially detrimental effects on the accuracy and robustness of tree inference. Therefore, many investigators select taxa and genes with high data coverage. Drawbacks of these selections are their exclusive reliance on data coverage without consideration of actual signal in the data which might, thus, not deliver optimal data matrices in terms of potential phylogenetic signal. In order to circumvent this problem, we have developed a heuristics implemented in a software called mare which (1) assesses information content of genes in supermatrices using a measure of potential signal combined with data coverage and (2) reduces supermatrices with a simple hill climbing procedure to submatrices with high total information content. We conducted simulation studies using matrices of 50 taxa x 50 genes with heterogeneous phylogenetic signal among genes and data coverage between 10 - 30 %.
About 2800 mitochondrial genomes of Metazoa are present in NCBI RefSeq today, two thirds belonging to vertebrates. Metazoan phylogeny was recently challenged by large scale EST approaches (phylogenomics), stabilizing classical nodes while simultaneously supporting new sister group hypotheses. The use of mitochondrial data in deep phylogeny analyses was often criticized because of high substitution rates on nucleotides, large differences in amino acid substitution rate between taxa, and biases in nucleotide frequencies. Nevertheless, mitochondrial genome data might still be promising as it allows for a larger taxon sampling, while presenting a smaller amount of sequence information. We present the most comprehensive analysis of bilaterian relationships based on mitochondrial genome data. The analyzed data set comprises more than 650 mitochondrial genomes that have been chosen to represent a profound sample of the phylogenetic as well as sequence diversity. The results are based on high quality amino acid alignments obtained from a complete reannotation of the mitogenomic sequences from NCBI RefSeq database. However, the results failed to give support for many otherwise undisputed high-ranking taxa, like Mollusca, Hexapoda, Arthropoda, and suffer from extreme long branches of Nematoda, Platyhelminthes, and some other taxa. In order to identify the sources of misleading phylogenetic signals, we discuss several problems associated with mitochondrial genome data sets, e.g. the nucleotide and amino acid landscapes and a strong correlation of gene rearrangements with long branches.
Remipedes are a small and enigmatic group of crustaceans, first described only 30 years ago. Analyses of both morphological and molecular data have recently suggested a close relationship between Remipedia and Hexapoda. If true, the remipedes occupy an important position in pancrustacean evolution and may be pivotal for understanding the evolutionary history of crustaceans and hexapods. However, it is important to test this hypothesis using new data and new types of analytical approaches. Here, we assembled a phylogenomic data set of 131 taxa, incorporating newly generated 454 expressed sequence tag (EST) data from six species of crustaceans, representing five lineages (Remipedia, Laevicaudata, Spinicaudata, Ostracoda, and Malacostraca). This data set includes all crustacean species for which EST data are available (46 species), and our largest alignment encompasses 866,479 amino acid positions and 1,886 genes. A series of phylogenomic analyses was performed to evaluate pancrustacean relationships. We significantly improved the quality of our data for predicting putative orthologous genes and for generating data subsets by matrix reduction procedures, thereby improving the signal to noise ratio in the data. Eight different data sets were constructed, representing various combinations of orthologous genes, data subsets, and taxa. Our results demonstrate that the different ways to compile an initial data set of core orthologs and the selection of data subsets by matrix reduction can have marked effects on the reconstructed phylogenetic trees. Nonetheless, all eight data sets strongly support Pancrustacea with Remipedia as the sister group to Hexapoda. This is the first time that a sister group relationship of Remipedia and Hexapoda has been inferred using a comprehensive phylogenomic data set that is based on EST data. We also show that selecting data subsets with increased overall signal can help to identify and prevent artifacts in phylogenetic analyses.
Enormous molecular sequence data have been accumulated over the past several years and are still exponentially growing with the use of faster and cheaper sequencing techniques. There is high and widespread interest in using these data for phylogenetic analyses. However, the amount of data that one can retrieve from public sequence repositories is virtually impossible to tame without dedicated software that automates processes. Here we present a novel bioinformatics pipeline for downloading, formatting, filtering and analyzing public sequence data deposited in GenBank. It combines some well-established programs with numerous newly developed software tools (available at http://software.zfmk.de/).
Molecular sequences do not only allow the reconstruction of phylogenetic relationships among species, but also provide information on the approximate divergence times. Whereas the fossil record dates the origin of most multicellular animal phyla during the Cambrian explosion less than 540 million years ago(mya), molecular clock calculations usually suggest much older dates. Here we used a large multiple sequence alignment derived from Expressed Sequence Tags and genomes comprising 129genes (37,476 amino acid positions) and 117 taxa, including 101 arthropods. We obtained consistent divergence time estimates applying relaxed Bayesian clock models with different priors and multiple calibration points. While the influence of substitution rates, missing data, and model priors were negligible, the clock model had significant effect. A log-normal autocorrelated model was selected on basis of cross-validation. We calculated that arthropods emerged ~600 mya. Onychophorans (velvet worms) and euarthropods split ~590 mya, Pancrustacea and Myriochelata ~560 mya, Myriapoda and Chelicerata ~555 mya, and Crustacea and Hexapoda ~510 mya. Endopterygote insects appeared ~390 mya. These dates are considerably younger than most previous molecular clock estimates and in better agreement with the fossil record. Nevertheless, a Precambrian origin of arthropods and other metazoan phyla is still supported. Our results also demonstrate the applicability of large datasets of random nuclear sequences for approximating the timing of multicellular animal evolution.
Martialinae are pale, eyeless and probably hypogaeic predatory ants. Morphological character sets suggest a close relationship to the ant subfamily Leptanillinae. Recent analyses based on molecular sequence data suggest that Martialinae are the sister group to all extant ants. However, by comparing molecular studies and different reconstruction methods, the position of Martialinae remains ambiguous. While this sister group relationship was well supported by Bayesian partitioned analyses, Maximum Likelihood approaches could not unequivocally resolve the position of Martialinae. By re-analysing a previous published molecular data set, we show that the Maximum Likelihood approach is highly appropriate to resolve deep ant relationships, especially between Leptanillinae, Martialinae and the remaining ant subfamilies. Based on improved alignments, alignment masking, and tree reconstructions with a sufficient number of bootstrap replicates, our results strongly reject a placement of Martialinae at the first split within the ant tree of life. Instead, we suggest that Leptanillinae are a sister group to all other extant ant subfamilies, whereas Martialinae branch off as a second lineage. This assumption is backed by approximately unbiased (AU) tests, additional Bayesian analyses and split networks. Our results demonstrate clear effects of improved alignment approaches, alignment masking and data partitioning. We hope that our study illustrates the importance of thorough, comprehensible phylogenetic analyses using the example of ant relationships.
Arthropods were the first animals to conquer land and air. They encompass more than three quarters of all described living species. This extraordinary evolutionary success is based on an astoundingly wide array of highly adaptive body organizations. A lack of robustly resolved phylogenetic relationships, however, currently impedes the reliable reconstruction of the underlying evolutionary processes. Here, we show that phylogenomic data can substantially advance our understanding of arthropod evolution and resolve several conflicts among existing hypotheses. We assembled a data set of 233 taxa and 775 genes from which an optimally informative data set of 117 taxa and 129 genes was finally selected using new heuristics and compared with the unreduced data set. We included novel expressed sequence tag (EST) data for 11 species and all published phylogenomic data augmented by recently published EST data on taxonomically important arthropod taxa. This thorough sampling reduces the chance of obtaining spurious results due to stochastic effects of undersampling taxa and genes. Orthology prediction of genes, alignment masking tools, and selection of most informative genes due to a balanced taxa-gene ratio using new heuristics were established. Our optimized data set robustly resolves major arthropod relationships. We received strong support for a sister group relationship of onychophorans and euarthropods and strong support for a close association of tardigrades and cycloneuralia. Within pancrustaceans, our analyses yielded paraphyletic crustaceans and monophyletic hexapods and robustly resolved monophyletic endopterygote insects. However, our analyses also showed for few deep splits that were recently thought to be resolved, for example, the position of myriapods, a remarkable sensitivity to methods of analyses.
Methods of alignment masking, which refers to the technique of excluding alignment blocks prior to tree reconstructions, have been successful in improving the signal-to-noise ratio in sequence alignments. However, the lack of formally well defined methods to identify randomness in sequence alignments has prevented a routine application of alignment masking. In this study, we compared the effects on tree reconstructions of the most commonly used profiling method (GBLOCKS) which uses a predefined set of rules in combination with alignment masking, with a new profiling approach (ALISCORE) based on Monte Carlo resampling within a sliding window, using different data sets and alignment methods. While the GBLOCKS approach excludes variable sections above a certain threshold which choice is left arbitrary, the ALISCORE algorithm is free of a priori rating of parameter space and therefore more objective.
FASconCAT is a user-friendly software that concatenates rapidly different kinds of sequence data into one supermatrix file. Output files are either in FASTA, PHYLIP or NEXUS format and are directly loadable in phylogenetic programs like PAUP *, RAxML or MrBayes. FASconCAT can handle FASTA, PHYLIP and CLUSTAL formatted input files in one single run. It provides useful information about each input file and the concatenated supermatrix. For example, the program provides the range information of each concatenated gene (partition) and delivers a check list of all concatenated sequences (taxa). Information about the base composition of single input files and the resulting supermatrix is supplied for nucleotide data. For given structure strings (e.g. secondary structures) it displays single unpaired (loop) and paired (stem) positions after the concatenation process. Optionally, FASconCAT generates NEXUS files of concatenated sequences, either with MrBayes commands directly executable in PAUP * and MrBayes, or without any specific commands. If favoured, FASconCAT dispenses output files in PHYLIP format with relaxed (unlimited signs) or restricted taxon names (up to ten signs) while sequences are printed in non-interleaved format. FASconCAT is implemented in Perl and freely available from http://software.zfmk.de. It runs on UNIX and MS Windows operating systems.
Whenever different data sets arrive at conflicting phylogenetic hypotheses, only testable causal explanations of sources of errors in at least one of the data sets allow us to critically choose among the conflicting hypotheses of relationships. The large (28S) and small (18S) subunit rRNAs are among the most popular markers for studies of deep phylogenies. However, some nodes supported by this data are suspected of being artifacts caused by peculiarities of the evolution of these molecules. Arthropod phylogeny is an especially controversial subject dotted with conflicting hypotheses which are dependent on data set and method of reconstruction. We assume that phylogenetic analyses based on these genes can be improved further i) by enlarging the taxon sample and ii) employing more realistic models of sequence evolution incorporating non-stationary substitution processes and iii) considering covariation and pairing of sites in rRNA-genes.
The phylogeny of insects, one of the most spectacular radiations of life on earth, has received considerable attention. However, the evolutionary roots of one intriguing group of insects, the twisted-wing parasites (Strepsiptera), remain unclear despite centuries of study and debate. Strepsiptera exhibit exceptional larval developmental features, consistent with a predicted step from direct (hemimetabolous) larval development to complete metamorphosis that could have set the stage for the spectacular radiation of metamorphic (holometabolous) insects. Here we report the sequencing of a Strepsiptera genome and show that the analysis of sequence-based genomic data (comprising more than 18 million nucleotides from nearly 4,500 genes obtained from a total of 13 insect genomes), along with genomic metacharacters, clarifies the phylogenetic origin of Strepsiptera and sheds light on the evolution of holometabolous insect development. Our results provide overwhelming support for Strepsiptera as the closest living relatives of beetles (Coleoptera). They demonstrate that the larval developmental features of Strepsiptera, reminiscent of those of hemimetabolous insects, are the result of convergence. Our analyses solve the long-standing enigma of the evolutionary roots of Strepsiptera and reveal that the holometabolous mode of insect development is more malleable than previously thought.
In this study, we investigated the relationships among insect orders with a main focus on Polyneoptera (lower Neoptera: roaches, mantids, earwigs, grasshoppers, etc.), and Paraneoptera (thrips, lice, bugs in the wide sense). The relationships between and within these groups of insects are difficult to resolve because only few informative molecular and morphological characters are available. Here, we provide the first phylogenomic expressed sequence tags data (EST: short sub-sequences from a c(opy) DNA sequence encoding for proteins) for stick insects (Phasmatodea) and webspinners (Embioptera) to complete published EST data. As recent EST datasets are characterized by a heterogeneous distribution of available genes across taxa, we use different rationales to optimize the data matrix composition. Our results suggest a monophyletic origin of Polyneoptera and Eumetabola (Paraneoptera + Holometabola). However, we identified artefacts of tree reconstruction (human louse Pediculus humanus assigned to Odonata (damselflies and dragonflies) or Holometabola (insects with a complete metamorphosis); mayfly genus Baetis nested within Neoptera), which were most probably rooted in a data matrix composition bias due to the inclusion of sequence data of entire proteomes. Until entire proteomes are available for each species in phylogenomic analyses, this potential pitfall should be carefully considered.
Related JoVE Video
Journal of Visualized Experiments
What is Visualize?
JoVE Visualize is a tool created to match the last 5 years of PubMed publications to methods in JoVE's video library.
How does it work?
We use abstracts found on PubMed and match them to JoVE videos to create a list of 10 to 30 related methods videos.
Video X seems to be unrelated to Abstract Y...
In developing our video relationships, we compare around 5 million PubMed articles to our library of over 4,500 methods videos. In some cases the language used in the PubMed abstracts makes matching that content to a JoVE video difficult. In other cases, there happens not to be any content in our video library that is relevant to the topic of a given abstract. In these cases, our algorithms are trying their best to display videos with relevant content, which can sometimes result in matched videos with only a slight relation.