Molecular Phylogenetics - Genomes - NCBI Bookshelf
Whether the objective is to construct a classification or to infer a phylogeny, the .. To date nobody has devised a perfect method for tree reconstruction, and. It is shown that Bayesian methods are under heavy development, as they offer The systematic test of the molecular clock assumption on recent data shows. Dec 19, Molecular dating of phylogenetic trees: A brief review of current methods Frank Rutschmann, Institute of Systematic Botany, University of Zürich, molecular clock and one global rate of substitution, (2) methods that correct.
Once the divergences between all pairs of samples have been determined, the resulting triangular matrix of differences is submitted to some form of statistical cluster analysisand the resulting dendrogram is examined in order to see whether the samples cluster in the way that would be expected from current ideas about the taxonomy of the group.
Any group of haplotypes that are all more similar to one another than any of them is to any other haplotype may be said to constitute a cladewhich may be visually represented as the figure displayed on the right demonstrates. Statistical techniques such as bootstrapping and jackknifing help in providing reliability estimates for the positions of haplotypes within the evolutionary trees.
In general, closely related organisms have a high degree of similarity in the molecular structure of these substances, while the molecules of organisms distantly related often show a pattern of dissimilarity.
Conserved sequences, such as mitochondrial DNA, are expected to accumulate mutations over time, and assuming a constant rate of mutation, provide a molecular clock for dating divergence. Molecular phylogeny uses such data to build a "relationship tree" that shows the probable evolution of various organisms. With the invention of Sanger sequencing init became possible to isolate and identify these molecular structures. The most common approach is the comparison of homologous sequences for genes using sequence alignment techniques to identify similarity.
Another application of molecular phylogeny is in DNA barcodingwherein the species of an individual organism is identified using small sections of mitochondrial DNA or chloroplast DNA. Another application of the techniques that make this possible can be seen in the very limited field of human genetics, such as the ever-more-popular use of genetic testing to determine a child's paternityas well as the emergence of a new branch of criminal forensics focused on evidence known as genetic fingerprinting.
Molecular phylogenetic analysis[ edit ] There are several methods available for performing a molecular phylogenetic analysis. A phylogenetic analysis typically consists of five major steps. The first stage comprises sequence acquisition. The following step consists of performing a multiple sequence alignment, which is the fundamental basis of constructing a phylogenetic tree.
The third stage includes different models of DNA and amino acid substitution. Several models of substitution exist. The fourth stage consists of various methods of tree building, including distance-based and character-based methods.
The normalized Hamming distance and the Jukes-Cantor correction formulas provide the degree of divergence and the probability that a nucleotide changes to another, respectively. UPGMA is a simple method; however, it is less accurate than the neighbor-joining approach.
Finally, the last step comprises evaluating the trees. A second possibility is to weight the models with respect to their probability of being the generating model given the data. For practical purposes, this posterior probability can be approximated by Akaike weights [ 96 ]. The difficulty here is that model averaging requires analyzing the data even for models that, a posteriori, turn out to have extremely small probabilities or weights.
This may be seen as a waste of resources computing time and storage space. Integrated Bayesian approaches Mixture models can work within the framework of maximum likelihood, but the treatment of the weight factors is complicated.Taxonomy, Phylogeny and Systematics
A sound alternative is to resort to a fully Bayesian approach. The advantage of RJ-MCMC samplers is that they allow estimating the phylogeny while integrating over the uncertainty pertaining to the parameters of the substitution model and even integrating over the model itself [ ].
Mixture models are available in BayesPhylogenies [ 37 ] for nucleotide models. The CAT model recently proved successful in a number of empirical  and simulation [ ] studies in avoiding the artifact known as long-branch attraction [ ]. This model is freely available in the PhyloBayes software see Table 1. All these models assume that each site evolve independently.
The independence assumption greatly simplifies the computations, but is also highly unrealistic. Models that describe the evolution of doublets in RNA genes [ ], triplets in codon models , or other models with local or context dependencies [ — ] exist, but complete dependence models are still in their infancy and, so far, have only been implemented in a Bayesian framework .
One particularly interesting feature of this approach is that complete dependence models incorporate information about the three-dimensional 3D structure of proteins and therefore permit the explicit modeling of structural constraints or of any other site-interdependence pattern [ ]. The incorporation of 3D structures also allows the establishment of a direct relationship between evolution at the DNA level and at the phenotypic level.
This link between genotype and phenotype is established via a proxy that plays the role of a fitness function which, in retrospect, can be used to predict amino-acid sequences compatible with a given target structure, that is, to help in protein design [ ]. In addition, while examples of adaptive evolution at the morphological level abound, from Darwin's finches in the Galapagos [ ] to cichlid fishes in the East African lakes [ ], the role of natural selection in shaping the evolution of genomes is much more controversial .
First, the neutral theory of molecular evolution asserts that much of the variation at the DNA level is due to the random fixation of mutations with no selective advantage [ ].
Second, a compelling body of evidence suggests that most of the genomic complexities have emerged by nonadaptive processes [ ]. A number of statistical approaches exist either to test neutrality at the population level or to detect positive Darwinian evolution at the species level [ ]. A shortcoming of neutrality tests is their dependence on a demographic model [ ] and their sensitivity to processes of molecular evolution such as among-site rate variation [ ].
They also do not model alternative hypotheses that would permit distinguishing negative selection from adaptive evolution. The development of demographic models based on Poisson random fields [ ] and composite likelihoods [ ] makes it possible both to estimate the strength of selection and to assess the impact of a variety of scenarios on allele frequency spectra [ 9 ].
But demographic singularities such as bottlenecks can still generate spurious signatures of positive selection . When effective population sizes are no longer a concern, for instance in studies at or above the species level, the detection of positive selection in protein-coding genes usually relies on codon models  see [ ] for a review including methods based on amino-acid models.
Codon models permit distinguishing between synonymous substitutions, which are likely to be neutral, and nonsynonymous substitutions, which are directly exposed to the action of selection. If synonymous and nonsynonymous substitutions accumulate at the same rate, then the protein-coding gene is likely to evolve neutrally. Alternatively, if nonsynonymous substitutions accumulate slower than synonymous substitutions, it must be because nonsynonymous substitutions are deleterious and this suggests the action of purifying selection.
Conversely, the accumulation of nonsynonymous substitutions faster than synonymous substitutions suggests the action of positive selection. An extension exists to detect selection in noncoding regions [ ], and a promising phylogenetic hidden Markov or phylo-HMM model permits detection of selection in overlapping genes [ ].
The most intuitive methods, called counting methods, work in three steps: Counting methods are however not optimal in the sense that most work on pairs of sequences and therefore, just like neighbor-joining, fail to account for all the information contained in an alignment.
In addition, simulations suggest that counting methods can be sensitive to a variety of biases such as unequal transition and transversion rates, or uneven base, or codon frequencies [ ]. Counting methods that incorporate these biases perform generally better than those that do not, but the maximum likelihood method still appears more robust to sever biases [ ]. In addition, the maximum likelihood method that accounts for all the information in a data set has good power and good accuracy to detect positive selection .
Branch models were then developed  quickly followed by site models [ — ] and by branch-site models . All these approaches, as implemented in PAML, rely on likelihood ratio tests to detect adaptive evolution: Simulations show that some of these tests are conservative [ ], so that detection of adaptive evolution should be safe as long as convergence of the analyses is carefully checked [ ], including in large-scale analyses [ ].
If the model allowing adaptive evolution explains the data significantly better than the null model, then an empirical Bayes approach can be used to identify which sites are likely to evolve adaptively [ ]. The empirical Bayes approach relies on estimates of the model parameters, which can have large sampling errors in small data sets. Because these sampling errors can cause the empirical Bayes site identification to be unreliable [ ], a Bayes empirical Bayes approach was proposed and was shown to have good power and low-false positive rates [ ].
Full Bayesian approaches that allow for uncertain parameter estimates were also proposed [ ]. Yet, simulations showed that they did not improve further on Bayes empirical Bayes estimates [ ], so that the computational overhead incurred by full Bayes methods may not be necessary in this case.
Molecular phylogenetics - Wikipedia
One particular case, where a Bayesian approach is however required, is to tell the signature of adaptive evolution from that of recombination, as these two processes can leave similar signals in DNA sequences. The codon model with recombination implemented in OmegaMap [ 48 ] can then be used to tease apart these two processes e. In spite of over four decades of history, molecular dating has only recently seen new developments.
One of the reasons for this slow progress is that, unlike the other parts of phylogenetic analysis, divergence times are parameters that cannot be estimated directly. Only sitewise likelihood values and distances between pairs of sequences are identifiable, that is, directly estimable. As a result, time durations and, likewise, divergence times cannot be estimated without making an additional assumption on the rates of evolution.
The simplest assumption is to posit that rates are constant in time, which is known as the molecular clock hypothesis [ ]. This hypothesis can be tested, for instance, with PAUP or PAML, by means of a likelihood ratio test that compares a constrained model clock with an unconstrained model no clock. The systematic test of the molecular clock assumption on recent data shows that this hypothesis is too often untenable [ ].
The most recent work has then focused on relaxing this assumption, and three different directions have emerged [ ]. A first possibility is to relax the clock globally on the phylogeny, but to assume that the hypothesis still holds locally for closely related species [ — ].
Recent developments of these local clock models now allow the use of multiple calibration points and of multiple genes [ ], the automatic placement of the clocks on the tree [ ] and the estimation of the number of local clocks [ ]. PAML can be used for most of these computations. However, local clock models still tend to underestimate rapid rate change [ ].
The second possibility to relax the global clock assumption is to assume that rates of evolution evolve in an autocorrelated manner along lineages and to minimize the amount of rate change over the entire phylogeny. The most popular approach in the plant community is Sanderson's penalized likelihood [ ], implemented in r8s [ 55 ].
This approach performs well on data sets for which the actual fossil dates are known [ ] but still tends to underestimate the actual amount of rate change [ ]. Bayesian methods appear today as the emerging approach to estimate divergence times. Taking inspiration from Sanderson's pioneering work [ ], Thorne et al.
These prior distributions can actually be interpreted as penalty functions [ 45], and they can have simple or more complicated forms [ ]. The Multidivtime program [ 45 — 47 ] is extremely quick to analyze data thanks to the use of a multivariate normal approximation of the likelihood surface. It assumes that rates of evolution change following a stationary lognormal prior distribution. Further work suggested that it might not always be the best performing rate prior [ — ], but these latter studies had two potential shortcomings: One potential limitation of the Bayesian approach described so far is its dependence on one single tree topology, which must be either known ahead of time or estimated by other means.
Recently, Drummond et al. As a result, their approach is able to estimate the most probable tree given the data and the substitution modelthe divergence times and the position of the root even without any outgroup or without resorting to a nonreversible model of substitution [ ]. In addition, when the focus is on estimating divergence times, a recent analysis suggests that this uncorrelated model of rate change could outperform the methods described above to accommodate rapid rate change among lineages [ ].
Implemented in BEAST, this approach offers a variety of substitution models and prior distributions and presents a graphic user interface that will appeal to numerous researchers [ 39 ]. If low coverage limits the complete assembly of many genome projects, it still allows the quick access to draft genomes for a growing number of species [ ].
Non-synonymous substitutions occur at a slower rate than synonymous ones. This is because a mutation that results in a change in the amino acid sequence of a protein might be deleterious to the organism, so the accumulation of non-synonymous mutations in the population is reduced by the processes of natural selection see Box This means that when gene sequences in two species are compared, there are usually fewer non-synonymous than synonymous substitutions.
The molecular clock for mitochondrial genes is faster than that for genes in the nuclear genome. This is probably because mitochondria lack many of the DNA repair systems that operate on nuclear genes Section Despite these complications, molecular clocks have become an immensely valuable adjunct to tree reconstruction, as we will see in the next section when we look at some typical molecular phylogenetics projects.
The Applications of Molecular Phylogenetics Molecular phylogenetics has grown in stature since the start of the s, largely because of the development of more rigorous methods for tree building, combined with the explosion of DNA sequence information obtained initially by PCR analysis and more recently by genome projects.
The importance of molecular phylogenetics has also been enhanced by the successful application of tree reconstruction and other phylogenetic techniques to some of the more perplexing issues in biology. In this final section we will survey some of these successes. Examples of the use of phylogenetic trees First, we will consider two projects that illustrate the various ways in which conventional tree reconstruction is being used in modern molecular biology.
DNA phylogenetics has clarified the evolutionary relationships between humans and other primates Darwin was the first biologist to speculate on the evolutionary relationships between humans and other primates. His view - that humans are closely related to the chimpanzee, gorilla and orangutan - was controversial when it was first proposed and fell out of favor, even among evolutionists, in the following decades.
Indeed, biologists were among the most ardent advocates of an anthropocentric view of our place in the animal world Goodman, From studies of fossils, paleontologists had concluded prior to that chimpanzees and gorillas are our closest relatives but that the relationship was distant, the split, leading to humans on the one hand and chimpanzees and gorillas on the other, having occurred some 15 million years ago. The first detailed molecular data, obtained by immunological studies in the s Goodman, ; Sarich and Wilson, confirmed that humans, chimpanzees and gorillas do indeed form a single clade see Box This was one of the first attempts to apply a molecular clock to phylogenetic data and the result was, quite naturally, treated with some suspicion.
In fact, an acrimonious debate opened up between paleontologists, who believed in the ancient split indicated by the fossil evidence, and biologists, who had more confidence in the recent date suggested by the molecular data. The text includes definitions of most of the important terms used in molecular phylogenetics. Here are a few additional definitions that you may find useful when reading research articles on this subject: As more and more molecular data were obtained, the difficulties in establishing the exact pattern of the evolutionary events that led to humans, chimpanzees and gorillas became apparent.
Comparisons of the mitochondrial genomes of the three species by restriction mapping Section 5. This makes it difficult to establish relationships unambiguously. The solution to the problem has been to make comparisons between as many different genes as possible and to target those loci that are expected to show the greatest amount of dissimilarity.
By14 different molecular datasets had been obtained, including sequences of variable loci such as pseudogenes and non-coding sequences Ruvolo, Analysis of these datasets confirmed that the chimpanzee is the closest relative to humans, with our lineages diverging 4. The gorilla is a slightly more distant cousin, its lineage having diverged from the human-chimp one between 0.
The demonstration in the early s that HIV-1 is responsible for AIDS was quickly followed by speculation about the origin of the disease. Speculation centered around the discovery that similar immunodeficiency viruses are present in primates such as the chimpanzee, sooty mangabey, mandrill and various monkeys.
These simian immunodeficiency viruses SIVs are not pathogenic in their normal hosts but it was thought that if one had become transferred to humans then within this new species the virus might have acquired new properties, such as the ability to cause disease and to spread rapidly through the population.
Retrovirus genomes accumulate mutations relatively quickly because reverse transcriptase, the enzyme that copies the RNA genome contained in the virus particle into the DNA version that integrates into the host genome see Section 2. This means that the molecular clock runs rapidly in retroviruses, and genomes that diverged quite recently display sufficient nucleotide dissimilarity for a phylogenetic analysis to be carried out.
Even though the evolutionary period we are interested in is less than years, HIV and SIV genomes contain sufficient data for their relationships to be inferred by phylogenetic analysis. The starting point for this phylogenetic analysis is RNA extracted from virus particles. Comparison between virus DNA sequences has resulted in the reconstructed tree shown in Figure This tree has a number of interesting features.
First it shows that different samples of HIV -1 have slightly different sequences, the samples as a whole forming a tight cluster, almost a star-like pattern, that radiates from one end of the unrooted tree.
This star-like topology implies that the global AIDS epidemic began with a very small number of viruses, perhaps just one, which have spread and diversified since entering the human population. The closest relative to HIV-1 among primates is the SIV of chimpanzees, the implication being that this virus jumped across the species barrier between chimps and humans and initiated the AIDS epidemic.
However, this epidemic did not begin immediately: It appears that HIV-2 was transferred to the human population independently of HIV-1, and from a different simian host. ZR59 is positioned near the root of the star-like pattern formed by genomes of this type. Based on Wain-Hobson more The RNA was highly fragmented and only a short DNA sequence could be obtained, but this was sufficient for the sequence to be placed on the phylogenetic tree see Figure This sequence, called ZR59, attaches to the tree by a short branch that emerges from near the center of the HIV-1 radiation.
The positioning indicates that the ZR59 sequence represents one of the earliest versions of HIV-1 and shows that the global spread of HIV-1 was already underway by A later and more comprehensive analysis of HIV-1 sequences has suggested that the spread began in the period between andwith a best estimate of Korber et al. Pinning down the date in this way has enabled epidemiologists to begin an investigation of the historic and social conditions that might have been responsible for the start of the AIDS epidemic.
Molecular phylogenetics as a tool in the study of human prehistory Now we will turn our attention to the use of molecular phylogenetics in intraspecific studies: We could choose any one of several different organisms to illustrate the approaches and applications of intraspecific studies, but many people look on Homo sapiens as the most interesting organism so we will investigate how molecular phylogenetics is being used to deduce the origins of modern humans and the geographic patterns of their recent migrations in the Old and New Worlds.
Intraspecific studies require highly variable genetic loci In any application of molecular phylogenetics, the genes chosen for analysis must display variability in the organisms being studied.
If there is no variability then there is no phylogenetic information. This presents a problem in intraspecific studies because the organisms being compared are all members of the same species and so share a great deal of genetic similarity, even if the species has split into populations that interbreed only intermittently. This means that the DNA sequences that are used in the phylogenetic analysis must be the most variable ones that are available.
In humans there are three main possibilities. Multiallelic genes, such as members of the HLA family Section 5. Cells do not appear to have any repair mechanism for reversing the effects of replication slippage, so new microsatellite alleles are generated relatively frequently. Mitochondrial DNA which, as mentioned in Section The mitochondrial DNA variants present in a single species are called haplotypes.
It is important to note that it is not the potential for change that is critical to the application of these loci in phylogenetic analysis, it is the fact that different alleles or haplotypes of the locus coexist in the population as a whole.
The loci are therefore polymorphic see Box The origins of modern humans - out of Africa or not?
It seems reasonably certain that the origin of humans lies in Africa because it is here that all of the oldest pre-human fossils have been found. The paleontological evidence reveals that hominids first moved outside of Africa over 1 million years ago, but these were not modern humans, they were an earlier species called Homo erectus. These were the first hominids to become geographically dispersed, eventually spreading to all parts of the Old World.
The events that followed the dispersal of Homo erectus are controversial. From comparisons using fossil skulls and bones, paleontologists have concluded that the Homo erectus populations that became located in different parts of the Old World gave rise to the modern human populations of those areas by a process called multiregional evolution Figure There may have been a certain amount of interbreeding between humans from different geographic regions, but, to a large extent, these various populations remained separate throughout their evolutionary history.