13 | Phylogenetics II Flashcards
What are the types of character-based methods for inferring a phylogeny?
Give some examples
parsimony
statistical methods
- maximum likelihood, eg RAxML, PhyML, IQ-Tree, FastTree
- Bayesian inference, eg MrBayes, BEAST
What is the basic concept and the basic steps for character based methods of phylogeny inferrence?
many trees are computed and evaluated
- starting tree (computed using quick and dirty methods)
- improve it by branch swapping / tree rearrangement
- keep best tree(s), according to optimality criterion
In character-based methods, what different criterion can be used to decide which tree to keep?
keep best tree(s), according to optimality criterion
- MP: single best tree = tree with the fewest changes
- ML: single best tree = tree with highest likelihood
- Bayesian: keep multiple best trees
In character based methods, how can the starting tree be improved upon?
improve it by branch swapping / tree rearrangement
- NNI (nearest neighbor interchange)
- SPG (subtree pruning and regrafting)
- TBR (tree bisection and reconnection)
How is the principle of parsimony applied to phylogenetics?
the “best” phylogeny = the one that requires the fewest changes, or mutations, along the branches
Developed before molecular data was available
- consider all (many) possible rooted phylogenies
- for each column of the alignment, determine how many changes would be required for each of the tree topologies
- add up the number of changes the data required for each
of the trees - the one requiring the least amount of changes: the “most
parsimonious” tree = the best tree under the MP criterion
What are the advantages and disadvantages of maximum parsimony for inferring trees?
advantages
- provides exact mapping of characters along branches
- can be used for non-molecular characters (morphology, RFLPs)
disadvantages
- uses an unrealistic model of substitutions, does not correct for multiple substitutions
- non-probabilistic: it is difficult to evaluate results in a statistical framework
- ignores branch lengths
What are the statistical phylogenetics methods and when did they arise
ML (Joe Felsenstein)
* 1981: Evolutionary trees from DNA sequences: A maximum likelihood approach
* 1985: Confidence Limits on Phylogenies: An Approach Using the Bootstrap
Bayesian approaches
* 1996 Bayesian approaches for phylogenetic inference
What can maximum likelihood be used for
1981
* program dnaml,
could deal with 15 sequences of length 60bp
since then
* tested and extended and used to infer phylogenies
* implementations for large data sets are available
* statistical framework
- can also be used to test phylogenetic hypotheses:
tree topology, divergence times, models of
evolution, rate heterogeneity, etc
What is likelihood - give an example with a coin toss
Likelihood: probability of the data, given the model P(D|M)
- Data: “head” in a coin toss
- Model about how the data was generated: fair or loaded coin
the chosen model affects the probability of the data!
* model 1: the coin is fair - (likelihood: 0.5)
* model 2: it’s a two headed coin (likelihood: 1)
* model 3: it’s a two tailed coin (likelihood: 0)
What is the likelihood in the context of phylogenetics?
Likelihood: probability of the data, given the model P(D|M)
- Data: an MSA
- Model about how the data was generated: substitution model (eg JC), tree topology T and branch lengths t.
How do we calculate the likelihood of a phylogeny? with example of alignment ‘GC’ and with K2P and pretend we have a time machine and know they evolved from A
= the probability of the alignment GC, given K2P model and following tree
G A C
top branch:
L = probability of a transition along a branch of length t
= alpha * t
bottom branch:
L = probability of a transversion along a branch of length t
= beta * t
Likelihood of a phylogeny
what is K2P
- model of sequence
evolution (Kimura 2P)
transition:
alpha: purine to purine / pyrimidine to pyrimidine
transversion
beta: purine to pyrimidine / pyrimidine to purine
- πA, πC, πG, πT are the base frequencies
Likelihood of a phylogeny
with alignment GC, K2P …
- what if we don’t know the ancestral position?
four different likelihoods with each base as ancestral position
eg prob of mutation along branch of length t:
L = πAα(β)t + πCα(β)t + πGα(β)t + πTα(β)t
α = transition eg pur to pur
β = transverstion eg pur to pyr
π = base frequencies
Likelihood of a phylogeny - steps?
- generate a tree topology and branch lengths
- calculate the likelihood for each site (column) and multiply to get the likelihood for all sites
- modify the branch lengths in order to optimize the likelihood of that topology
- select the next topology using a heuristic, modify branch lengths (repeat)
- choose the topology and branch lengths with the highest likelihood
Probabilistic (statistical) inference approaches
ML vs Bayesian
Maximum likelihood: P(Data|Tree)
* result of ML analysis: one tree with certain branch lengths that has the highest likelihood of having generated the data
* but what about different trees with likelihoods that are just a little lower than the best tree?
only the single most likely tree is returned
Bayesian inference: P(Tree|Data)
* Bayesian inference considers many trees with similar high likelihoods