13 | Phylogenetics II Flashcards
What are the types of character-based methods for inferring a phylogeny?
Give some examples
parsimony
statistical methods
- maximum likelihood, eg RAxML, PhyML, IQ-Tree, FastTree
- Bayesian inference, eg MrBayes, BEAST
What is the basic concept and the basic steps for character based methods of phylogeny inferrence?
many trees are computed and evaluated
- starting tree (computed using quick and dirty methods)
- improve it by branch swapping / tree rearrangement
- keep best tree(s), according to optimality criterion
In character-based methods, what different criterion can be used to decide which tree to keep?
keep best tree(s), according to optimality criterion
- MP: single best tree = tree with the fewest changes
- ML: single best tree = tree with highest likelihood
- Bayesian: keep multiple best trees
In character based methods, how can the starting tree be improved upon?
improve it by branch swapping / tree rearrangement
- NNI (nearest neighbor interchange)
- SPG (subtree pruning and regrafting)
- TBR (tree bisection and reconnection)
How is the principle of parsimony applied to phylogenetics?
the “best” phylogeny = the one that requires the fewest changes, or mutations, along the branches
Developed before molecular data was available
- consider all (many) possible rooted phylogenies
- for each column of the alignment, determine how many changes would be required for each of the tree topologies
- add up the number of changes the data required for each
of the trees - the one requiring the least amount of changes: the “most
parsimonious” tree = the best tree under the MP criterion
What are the advantages and disadvantages of maximum parsimony for inferring trees?
advantages
- provides exact mapping of characters along branches
- can be used for non-molecular characters (morphology, RFLPs)
disadvantages
- uses an unrealistic model of substitutions, does not correct for multiple substitutions
- non-probabilistic: it is difficult to evaluate results in a statistical framework
- ignores branch lengths
What are the statistical phylogenetics methods and when did they arise
ML (Joe Felsenstein)
* 1981: Evolutionary trees from DNA sequences: A maximum likelihood approach
* 1985: Confidence Limits on Phylogenies: An Approach Using the Bootstrap
Bayesian approaches
* 1996 Bayesian approaches for phylogenetic inference
What can maximum likelihood be used for
1981
* program dnaml,
could deal with 15 sequences of length 60bp
since then
* tested and extended and used to infer phylogenies
* implementations for large data sets are available
* statistical framework
- can also be used to test phylogenetic hypotheses:
tree topology, divergence times, models of
evolution, rate heterogeneity, etc
What is likelihood - give an example with a coin toss
Likelihood: probability of the data, given the model P(D|M)
- Data: “head” in a coin toss
- Model about how the data was generated: fair or loaded coin
the chosen model affects the probability of the data!
* model 1: the coin is fair - (likelihood: 0.5)
* model 2: it’s a two headed coin (likelihood: 1)
* model 3: it’s a two tailed coin (likelihood: 0)
What is the likelihood in the context of phylogenetics?
Likelihood: probability of the data, given the model P(D|M)
- Data: an MSA
- Model about how the data was generated: substitution model (eg JC), tree topology T and branch lengths t.
How do we calculate the likelihood of a phylogeny? with example of alignment ‘GC’ and with K2P and pretend we have a time machine and know they evolved from A
= the probability of the alignment GC, given K2P model and following tree
G A C
top branch:
L = probability of a transition along a branch of length t
= alpha * t
bottom branch:
L = probability of a transversion along a branch of length t
= beta * t
Likelihood of a phylogeny
what is K2P
- model of sequence
evolution (Kimura 2P)
transition:
alpha: purine to purine / pyrimidine to pyrimidine
transversion
beta: purine to pyrimidine / pyrimidine to purine
- πA, πC, πG, πT are the base frequencies
Likelihood of a phylogeny
with alignment GC, K2P …
- what if we don’t know the ancestral position?
four different likelihoods with each base as ancestral position
eg prob of mutation along branch of length t:
L = πAα(β)t + πCα(β)t + πGα(β)t + πTα(β)t
α = transition eg pur to pur
β = transverstion eg pur to pyr
π = base frequencies
Likelihood of a phylogeny - steps?
- generate a tree topology and branch lengths
- calculate the likelihood for each site (column) and multiply to get the likelihood for all sites
- modify the branch lengths in order to optimize the likelihood of that topology
- select the next topology using a heuristic, modify branch lengths (repeat)
- choose the topology and branch lengths with the highest likelihood
Probabilistic (statistical) inference approaches
ML vs Bayesian
Maximum likelihood: P(Data|Tree)
* result of ML analysis: one tree with certain branch lengths that has the highest likelihood of having generated the data
* but what about different trees with likelihoods that are just a little lower than the best tree?
only the single most likely tree is returned
Bayesian inference: P(Tree|Data)
* Bayesian inference considers many trees with similar high likelihoods
Bayesian inference
basic concept, phylogenetic example
- update beliefs about random events in light of evidence about those events
phylogenetic example
* data: the alignment
* use a“best guess” regarding the evolution of these sequences (tree topology, branch lengths, substitution model, etc)
* use Bayes’ Rule to update the prior probability and obtain a posterior probability of a model & parameters
Bayesian inference
What is posterior probability? equation?
P(𝛉|Data)
f(𝛉|D) = 1/z * f(𝛉)f(D|𝛉)
- 𝛉: unknown model & parameter
- D: data
- 1/z: normalizing constant (ensures that f(𝛉|D) integrates to 1 and is a proper statistical distribution)
- f(𝛉): prior probability
- f(D|𝛉): probability of the data, given the model & parameter
Bayesian inference
How does posterior probability for phylogeny inference work? Is it feasible?
- computationally too expensive
- approximate by sampling trees from the posterior probability distribution (MCMC)
- visits trees with a high
probability more often - periodically sample from this sequence of trees
- generate 10,000,000 iterations (trees), save every 1000th tree = 10,000 trees!
- visits trees with a high
- can take weeks (or months)
- result is (a consensus of) the sampled trees
Bayesian inference
pros and cons
cons
* philosophy is still somewhat controversial
* we must specify a priori, which can be tricky
* models / number of parameters need to be selected carefully! (results / running time)
* non-convergence possible
pros
* natural interpretation of probability: P(M|D)
* good for estimation of divergence times
* support for branches is inherent to the analysis
Bayesian confidence values?
proportion of sampled trees in which these groups were observed
– “support values” – generally anything over 0.7 is considered good
How can we assess confidence in clades of NJ, MP, ML trees?
we need support values for branches on NJ, MP, ML trees
- resample the alignment with replacement
- the original data are the columns of the multiple alignment
- sample X number of new alignments (“pseudo-sample”, “bootstrap data set”), of the same length
- compute a phylogeny for each pseudo-sample
- count how many times a bipartition (group) appears
- label nodes from the original (best) tree with bootstrap proportions
What is bootstrapping
- bootstrapping measures confidence for single nodes (not the entire tree!)
- it is a measure of repeatability (not of accuracy!)
- usually: 70% = “support”
- problematic for huge data sets: bootstrap values decrease
- with increasing no. of characters & taxa, and/or in the presence of unstable taxa
- use modified (https://doi.org/10.1038/s41586-018-0043-0)
or no boostrap values
How can we evaluate a dataset (MSA)?
MSA
- columns & distinct patterns per sample
- % invariant sites
- entropy, treeness, etc
phylogeny
- branch length statistics
other
- % unique topologies among MP trees
Evaluate data set (MSA)
- methods will always find at least one tree; even random data will have some ‘best’ tree
- test for phylogenetic signal in the data
- no signal (e.g., slow evolution)
- non-phylogenetic signal, systematic bias (e.g., variation across lineages in evolutionary rates, nt composition, etc. )
- can lead to highly supported but wrong tree!
- example: mutational saturation: similarity between sequences = similarity in nucleotide frequencies: phylogenetic signal is lost
What is meant by comparing trees?
- same alignment, different methods
- same alignment, posterior trees of Bayesian
analyses - same genomes/samples, different loci
- …
summarize: consensus methods
compare:
* visual methods (e.g., overlay topologies)
* quantitative methods
Comparing tree topologies?
2 types of distances? describe briefly
most widely used? problems?
alternative approaches?
Robinson Foulds distance
* number of partitions that are in one tree but not the other
Quartet distance
* number of quartets that are different between two trees
RF: most widely used, but has some problems
* statistical interpretation of distances?
* low resolution; small changes ➜ large distance
* …
alternative approaches?
* example: focus on evolutionary relationships in
rooted trees (Kendall & Colijn, MBE 2016)
- compare MRCA of each pair of tips
- use just topology or also branch lengths
describing trees
some terms
* ancestor, ancestral lineage, basal group,
(monophyletic) clade, diversification, homolog,
ingroup, lineage, outgroup, sister group, node,
species tree, gene tree, paralog, ortholog, root,
duplication event, speciation event, …
?
EXAM QUESTION
describe process of NJ and ML and name and describe one method to test validity of found clades (2022)
NJ
- distance-based
- Like a cookbook recipe
- very fast greedy heuristic
- use MSAs to make distance matrix
- compute only a single tree
ML
- character-based
- calculate likelihood of a phylogeny = probability of a dataset (MSA) given a model (substitution matrix and a tree)
- evaluate many different trees and pick the optimal one
test validity: bootstrapping
- create pseudosamples of the same length as original MSA but with columns shuffled
- compute a phylogeny for each pseudo-sample
- count how many times a bipartition (group) appears
- label nodes from the original (best) tree with bootstrap proportions
EXAM QUESTION
For which of the tree building methods we saw do multiple substitutions pose a larger difficulty? Explain your answer. (2019)
Which phylogeny reconstruction method does not correct for multiple substitutions and how does it present a problem (2020)
Maximum Parsimony (MP)
MP disadvantages
- uses an unrealistic model of substitutions
- does not correct for multiple substitutions
divergent sequences = more multiple substitutions -> MP doesn’t work here –> doesn’t make sense with today’s data
lit:
- inherent assumption of slow rate of evolution (so slow that multiple hits are negligible).
- molecular phylogenetics moving towards resolving deep phylogenies (typically involve sequences with multiple subs at the same site)
- –> MP has gradually faded away, like an old soldier.
Explain MP with an example
We have this MSA:
123456789
seqA GTCGACTCC
seqB GTAGACTAC
seqC GCGGCCATC
seqD GCTGCCAGC
the possible trees are:
((A,B),C),D; ((A,C),B),D; ((B,C),A),D;
we go column by column, checking how many mutations would be needed for each different possible tree of Nts
- 1st column: skip cos all the same
- 2nd column: ((T,T),C),C; = 1 mut / ((T,C),T),C; = 2 mut / ((T,C),T),C; = 2 mut
- etc
add up the total mutations for each topology and choose the one with the lowest value