13 | Phylogenetics II Flashcards by Stevie Davies

What are the types of character-based methods for inferring a phylogeny?

Give some examples

parsimony

statistical methods
- maximum likelihood, eg RAxML, PhyML, IQ-Tree, FastTree
- Bayesian inference, eg MrBayes, BEAST

How well did you know this?

Not at all

Perfectly

What is the basic concept and the basic steps for character based methods of phylogeny inferrence?

many trees are computed and evaluated

starting tree (computed using quick and dirty methods)
improve it by branch swapping / tree rearrangement
keep best tree(s), according to optimality criterion

How well did you know this?

Not at all

Perfectly

In character-based methods, what different criterion can be used to decide which tree to keep?

keep best tree(s), according to optimality criterion
- MP: single best tree = tree with the fewest changes
- ML: single best tree = tree with highest likelihood
- Bayesian: keep multiple best trees

How well did you know this?

Not at all

Perfectly

In character based methods, how can the starting tree be improved upon?

improve it by branch swapping / tree rearrangement
- NNI (nearest neighbor interchange)
- SPG (subtree pruning and regrafting)
- TBR (tree bisection and reconnection)

How well did you know this?

Not at all

Perfectly

How is the principle of parsimony applied to phylogenetics?

the “best” phylogeny = the one that requires the fewest changes, or mutations, along the branches

Developed before molecular data was available

consider all (many) possible rooted phylogenies
for each column of the alignment, determine how many changes would be required for each of the tree topologies
add up the number of changes the data required for each
of the trees
the one requiring the least amount of changes: the “most
parsimonious” tree = the best tree under the MP criterion

How well did you know this?

Not at all

Perfectly

What are the advantages and disadvantages of maximum parsimony for inferring trees?

advantages
- provides exact mapping of characters along branches
- can be used for non-molecular characters (morphology, RFLPs)

disadvantages
- uses an unrealistic model of substitutions, does not correct for multiple substitutions
- non-probabilistic: it is difficult to evaluate results in a statistical framework
- ignores branch lengths

How well did you know this?

Not at all

Perfectly

What are the statistical phylogenetics methods and when did they arise

ML (Joe Felsenstein)
* 1981: Evolutionary trees from DNA sequences: A maximum likelihood approach
* 1985: Confidence Limits on Phylogenies: An Approach Using the Bootstrap

Bayesian approaches
* 1996 Bayesian approaches for phylogenetic inference

How well did you know this?

Not at all

Perfectly

What can maximum likelihood be used for

1981
* program dnaml,
could deal with 15 sequences of length 60bp

since then
* tested and extended and used to infer phylogenies
* implementations for large data sets are available
* statistical framework
- can also be used to test phylogenetic hypotheses:
tree topology, divergence times, models of
evolution, rate heterogeneity, etc

How well did you know this?

Not at all

Perfectly

What is likelihood - give an example with a coin toss

Likelihood: probability of the data, given the model P(D|M)
- Data: “head” in a coin toss
- Model about how the data was generated: fair or loaded coin

the chosen model affects the probability of the data!
* model 1: the coin is fair - (likelihood: 0.5)
* model 2: it’s a two headed coin (likelihood: 1)
* model 3: it’s a two tailed coin (likelihood: 0)

How well did you know this?

Not at all

Perfectly

What is the likelihood in the context of phylogenetics?

Likelihood: probability of the data, given the model P(D|M)
- Data: an MSA
- Model about how the data was generated: substitution model (eg JC), tree topology T and branch lengths t.

How well did you know this?

Not at all

Perfectly

How do we calculate the likelihood of a phylogeny? with example of alignment ‘GC’ and with K2P and pretend we have a time machine and know they evolved from A

= the probability of the alignment GC, given K2P model and following tree

G A  
C

top branch:
L = probability of a transition along a branch of length t
= alpha * t

bottom branch:
L = probability of a transversion along a branch of length t
= beta * t

How well did you know this?

Not at all

Perfectly

Likelihood of a phylogeny

what is K2P

model of sequence
evolution (Kimura 2P)

transition:
alpha: purine to purine / pyrimidine to pyrimidine

transversion
beta: purine to pyrimidine / pyrimidine to purine

πA, πC, πG, πT are the base frequencies

How well did you know this?

Not at all

Perfectly

Likelihood of a phylogeny
with alignment GC, K2P …

what if we don’t know the ancestral position?

four different likelihoods with each base as ancestral position

eg prob of mutation along branch of length t:
L = π_Aα(β)t + π_Cα(β)t + π_Gα(β)t + π_Tα(β)t

α = transition eg pur to pur
β = transverstion eg pur to pyr
π = base frequencies

How well did you know this?

Not at all

Perfectly

Likelihood of a phylogeny - steps?

generate a tree topology and branch lengths
calculate the likelihood for each site (column) and multiply to get the likelihood for all sites
modify the branch lengths in order to optimize the likelihood of that topology
select the next topology using a heuristic, modify branch lengths (repeat)
choose the topology and branch lengths with the highest likelihood

How well did you know this?

Not at all

Perfectly

Probabilistic (statistical) inference approaches

ML vs Bayesian

Maximum likelihood: P(Data|Tree)
* result of ML analysis: one tree with certain branch lengths that has the highest likelihood of having generated the data
* but what about different trees with likelihoods that are just a little lower than the best tree?
only the single most likely tree is returned

Bayesian inference: P(Tree|Data)
* Bayesian inference considers many trees with similar high likelihoods

How well did you know this?

Not at all

Perfectly

Bayesian inference
basic concept, phylogenetic example

update beliefs about random events in light of evidence about those events

phylogenetic example
* data: the alignment
* use a“best guess” regarding the evolution of these sequences (tree topology, branch lengths, substitution model, etc)
* use Bayes’ Rule to update the prior probability and obtain a posterior probability of a model & parameters

Bayesian inference

What is posterior probability? equation?

P(𝛉|Data)

f(𝛉|D) = 1/z * f(𝛉)f(D|𝛉)

𝛉: unknown model & parameter
D: data
1/z: normalizing constant (ensures that f(𝛉|D) integrates to 1 and is a proper statistical distribution)
f(𝛉): prior probability
f(D|𝛉): probability of the data, given the model & parameter

Bayesian inference

How does posterior probability for phylogeny inference work? Is it feasible?

computationally too expensive
approximate by sampling trees from the posterior probability distribution (MCMC)
- visits trees with a high
  probability more often
- periodically sample from this sequence of trees
  - generate 10,000,000 iterations (trees), save every 1000th tree = 10,000 trees!
can take weeks (or months)
result is (a consensus of) the sampled trees

Bayesian inference

pros and cons

cons
* philosophy is still somewhat controversial
* we must specify a priori, which can be tricky
* models / number of parameters need to be selected carefully! (results / running time)
* non-convergence possible

pros
* natural interpretation of probability: P(M|D)
* good for estimation of divergence times
* support for branches is inherent to the analysis

Bayesian confidence values?

proportion of sampled trees in which these groups were observed

– “support values” – generally anything over 0.7 is considered good

How can we assess confidence in clades of NJ, MP, ML trees?

we need support values for branches on NJ, MP, ML trees

resample the alignment with replacement
- the original data are the columns of the multiple alignment
- sample X number of new alignments (“pseudo-sample”, “bootstrap data set”), of the same length
compute a phylogeny for each pseudo-sample
count how many times a bipartition (group) appears
label nodes from the original (best) tree with bootstrap proportions

What is bootstrapping

bootstrapping measures confidence for single nodes (not the entire tree!)
it is a measure of repeatability (not of accuracy!)
usually: 70% = “support”
problematic for huge data sets: bootstrap values decrease
- with increasing no. of characters & taxa, and/or in the presence of unstable taxa
- use modified (https://doi.org/10.1038/s41586-018-0043-0)
  or no boostrap values

How can we evaluate a dataset (MSA)?

MSA
- columns & distinct patterns per sample
- % invariant sites
- entropy, treeness, etc

phylogeny
- branch length statistics

other
- % unique topologies among MP trees

Evaluate data set (MSA)
- methods will always find at least one tree; even random data will have some ‘best’ tree
- test for phylogenetic signal in the data
- no signal (e.g., slow evolution)
- non-phylogenetic signal, systematic bias (e.g., variation across lineages in evolutionary rates, nt composition, etc. )
- can lead to highly supported but wrong tree!
- example: mutational saturation: similarity between sequences = similarity in nucleotide frequencies: phylogenetic signal is lost

What is meant by comparing trees?

same alignment, different methods
same alignment, posterior trees of Bayesian
analyses
same genomes/samples, different loci
…

summarize: consensus methods
compare:
* visual methods (e.g., overlay topologies)
* quantitative methods

Comparing tree topologies? 2 types of distances? describe briefly most widely used? problems? alternative approaches?

Robinson Foulds distance * number of partitions that are in one tree but not the other Quartet distance * number of quartets that are different between two trees RF: most widely used, but has some problems * statistical interpretation of distances? * low resolution; small changes ➜ large distance * … alternative approaches? * example: focus on evolutionary relationships in rooted trees (Kendall & Colijn, MBE 2016) - compare MRCA of each pair of tips - use just topology or also branch lengths

describing trees some terms * ancestor, ancestral lineage, basal group, (monophyletic) clade, diversification, homolog, ingroup, lineage, outgroup, sister group, node, species tree, gene tree, paralog, ortholog, root, duplication event, speciation event, ... ?

EXAM QUESTION describe process of NJ and ML and name and describe one method to test validity of found clades (2022)

NJ - distance-based - Like a cookbook recipe - very fast greedy heuristic - use MSAs to make distance matrix - compute only a single tree ML - character-based - calculate likelihood of a phylogeny = probability of a dataset (MSA) given a model (substitution matrix and a tree) - evaluate many different trees and pick the optimal one test validity: bootstrapping - create pseudosamples of the same length as original MSA but with columns shuffled - compute a phylogeny for each pseudo-sample - count how many times a bipartition (group) appears - label nodes from the original (best) tree with bootstrap proportions

EXAM QUESTION For which of the tree building methods we saw do multiple substitutions pose a larger difficulty? Explain your answer. (2019) Which phylogeny reconstruction method does not correct for multiple substitutions and how does it present a problem (2020)

Maximum Parsimony (MP) MP disadvantages - uses an unrealistic model of substitutions - does not correct for multiple substitutions divergent sequences = more multiple substitutions -> MP doesn't work here --> doesn't make sense with today's data lit: - inherent assumption of slow rate of evolution (so slow that multiple hits are negligible). - molecular phylogenetics moving towards resolving deep phylogenies (typically involve sequences with multiple subs at the same site) - --> MP has gradually faded away, like an old soldier.

Explain MP with an example

We have this MSA: 123456789 seqA GTCGACTCC seqB GTAGACTAC seqC GCGGCCATC seqD GCTGCCAGC the possible trees are: ((A,B),C),D; ((A,C),B),D; ((B,C),A),D; we go column by column, checking how many mutations would be needed for each different possible tree of Nts - 1st column: skip cos all the same - 2nd column: ((T,T),C),C; = 1 mut / ((T,C),T),C; = 2 mut / ((T,C),T),C; = 2 mut - etc add up the total mutations for each topology and choose the one with the lowest value