13 | Phylogenetics II Flashcards

1
Q

What are the types of character-based methods for inferring a phylogeny?

Give some examples

A

parsimony

statistical methods
- maximum likelihood, eg RAxML, PhyML, IQ-Tree, FastTree
- Bayesian inference, eg MrBayes, BEAST

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the basic concept and the basic steps for character based methods of phylogeny inferrence?

A

many trees are computed and evaluated

  • starting tree (computed using quick and dirty methods)
  • improve it by branch swapping / tree rearrangement
  • keep best tree(s), according to optimality criterion
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

In character-based methods, what different criterion can be used to decide which tree to keep?

A

keep best tree(s), according to optimality criterion
- MP: single best tree = tree with the fewest changes
- ML: single best tree = tree with highest likelihood
- Bayesian: keep multiple best trees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

In character based methods, how can the starting tree be improved upon?

A

improve it by branch swapping / tree rearrangement
- NNI (nearest neighbor interchange)
- SPG (subtree pruning and regrafting)
- TBR (tree bisection and reconnection)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How is the principle of parsimony applied to phylogenetics?

A

the “best” phylogeny = the one that requires the fewest changes, or mutations, along the branches

Developed before molecular data was available

  1. consider all (many) possible rooted phylogenies
  2. for each column of the alignment, determine how many changes would be required for each of the tree topologies
  3. add up the number of changes the data required for each
    of the trees
  4. the one requiring the least amount of changes: the “most
    parsimonious” tree = the best tree under the MP criterion
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the advantages and disadvantages of maximum parsimony for inferring trees?

A

advantages
- provides exact mapping of characters along branches
- can be used for non-molecular characters (morphology, RFLPs)

disadvantages
- uses an unrealistic model of substitutions, does not correct for multiple substitutions
- non-probabilistic: it is difficult to evaluate results in a statistical framework
- ignores branch lengths

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the statistical phylogenetics methods and when did they arise

A

ML (Joe Felsenstein)
* 1981: Evolutionary trees from DNA sequences: A maximum likelihood approach
* 1985: Confidence Limits on Phylogenies: An Approach Using the Bootstrap

Bayesian approaches
* 1996 Bayesian approaches for phylogenetic inference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What can maximum likelihood be used for

A

1981
* program dnaml,
could deal with 15 sequences of length 60bp

since then
* tested and extended and used to infer phylogenies
* implementations for large data sets are available
* statistical framework
- can also be used to test phylogenetic hypotheses:
tree topology, divergence times, models of
evolution, rate heterogeneity, etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is likelihood - give an example with a coin toss

A

Likelihood: probability of the data, given the model P(D|M)
- Data: “head” in a coin toss
- Model about how the data was generated: fair or loaded coin

the chosen model affects the probability of the data!
* model 1: the coin is fair - (likelihood: 0.5)
* model 2: it’s a two headed coin (likelihood: 1)
* model 3: it’s a two tailed coin (likelihood: 0)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the likelihood in the context of phylogenetics?

A

Likelihood: probability of the data, given the model P(D|M)
- Data: an MSA
- Model about how the data was generated: substitution model (eg JC), tree topology T and branch lengths t.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do we calculate the likelihood of a phylogeny? with example of alignment ‘GC’ and with K2P and pretend we have a time machine and know they evolved from A

A

= the probability of the alignment GC, given K2P model and following tree

G A  
C

top branch:
L = probability of a transition along a branch of length t
= alpha * t

bottom branch:
L = probability of a transversion along a branch of length t
= beta * t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Likelihood of a phylogeny

what is K2P

A
  • model of sequence
    evolution (Kimura 2P)

transition:
alpha: purine to purine / pyrimidine to pyrimidine

transversion
beta: purine to pyrimidine / pyrimidine to purine

  • πA, πC, πG, πT are the base frequencies
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Likelihood of a phylogeny
with alignment GC, K2P …

  • what if we don’t know the ancestral position?
A

four different likelihoods with each base as ancestral position

eg prob of mutation along branch of length t:
L = πAα(β)t + πCα(β)t + πGα(β)t + πTα(β)t

α = transition eg pur to pur
β = transverstion eg pur to pyr
π = base frequencies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Likelihood of a phylogeny - steps?

A
  1. generate a tree topology and branch lengths
  2. calculate the likelihood for each site (column) and multiply to get the likelihood for all sites
  3. modify the branch lengths in order to optimize the likelihood of that topology
  4. select the next topology using a heuristic, modify branch lengths (repeat)
  5. choose the topology and branch lengths with the highest likelihood
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Probabilistic (statistical) inference approaches

ML vs Bayesian

A

Maximum likelihood: P(Data|Tree)
* result of ML analysis: one tree with certain branch lengths that has the highest likelihood of having generated the data
* but what about different trees with likelihoods that are just a little lower than the best tree?
only the single most likely tree is returned

Bayesian inference: P(Tree|Data)
* Bayesian inference considers many trees with similar high likelihoods

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Bayesian inference
basic concept, phylogenetic example

A
  • update beliefs about random events in light of evidence about those events

phylogenetic example
* data: the alignment
* use a“best guess” regarding the evolution of these sequences (tree topology, branch lengths, substitution model, etc)
* use Bayes’ Rule to update the prior probability and obtain a posterior probability of a model & parameters

17
Q

Bayesian inference

What is posterior probability? equation?

A

P(𝛉|Data)

f(𝛉|D) = 1/z * f(𝛉)f(D|𝛉)

  • 𝛉: unknown model & parameter
  • D: data
  • 1/z: normalizing constant (ensures that f(𝛉|D) integrates to 1 and is a proper statistical distribution)
  • f(𝛉): prior probability
  • f(D|𝛉): probability of the data, given the model & parameter
18
Q

Bayesian inference

How does posterior probability for phylogeny inference work? Is it feasible?

A
  • computationally too expensive
  • approximate by sampling trees from the posterior probability distribution (MCMC)
    • visits trees with a high
      probability more often
    • periodically sample from this sequence of trees
      • generate 10,000,000 iterations (trees), save every 1000th tree = 10,000 trees!
  • can take weeks (or months)
  • result is (a consensus of) the sampled trees
19
Q

Bayesian inference

pros and cons

A

cons
* philosophy is still somewhat controversial
* we must specify a priori, which can be tricky
* models / number of parameters need to be selected carefully! (results / running time)
* non-convergence possible

pros
* natural interpretation of probability: P(M|D)
* good for estimation of divergence times
* support for branches is inherent to the analysis

20
Q

Bayesian confidence values?

A

proportion of sampled trees in which these groups were observed

– “support values” – generally anything over 0.7 is considered good

21
Q

How can we assess confidence in clades of NJ, MP, ML trees?

A

we need support values for branches on NJ, MP, ML trees

  • resample the alignment with replacement
    • the original data are the columns of the multiple alignment
    • sample X number of new alignments (“pseudo-sample”, “bootstrap data set”), of the same length
  • compute a phylogeny for each pseudo-sample
  • count how many times a bipartition (group) appears
  • label nodes from the original (best) tree with bootstrap proportions
22
Q

What is bootstrapping

A
  • bootstrapping measures confidence for single nodes (not the entire tree!)
  • it is a measure of repeatability (not of accuracy!)
  • usually: 70% = “support”
  • problematic for huge data sets: bootstrap values decrease
    • with increasing no. of characters & taxa, and/or in the presence of unstable taxa
    • use modified (https://doi.org/10.1038/s41586-018-0043-0)
      or no boostrap values
23
Q

How can we evaluate a dataset (MSA)?

A

MSA
- columns & distinct patterns per sample
- % invariant sites
- entropy, treeness, etc

phylogeny
- branch length statistics

other
- % unique topologies among MP trees

Evaluate data set (MSA)
- methods will always find at least one tree; even random data will have some ‘best’ tree
- test for phylogenetic signal in the data
- no signal (e.g., slow evolution)
- non-phylogenetic signal, systematic bias (e.g., variation across lineages in evolutionary rates, nt composition, etc. )
- can lead to highly supported but wrong tree!
- example: mutational saturation: similarity between sequences = similarity in nucleotide frequencies: phylogenetic signal is lost

24
Q

What is meant by comparing trees?

A
  • same alignment, different methods
  • same alignment, posterior trees of Bayesian
    analyses
  • same genomes/samples, different loci

summarize: consensus methods
compare:
* visual methods (e.g., overlay topologies)
* quantitative methods

25
Q

Comparing tree topologies?

2 types of distances? describe briefly

most widely used? problems?

alternative approaches?

A

Robinson Foulds distance
* number of partitions that are in one tree but not the other

Quartet distance
* number of quartets that are different between two trees

RF: most widely used, but has some problems
* statistical interpretation of distances?
* low resolution; small changes ➜ large distance
* …
alternative approaches?
* example: focus on evolutionary relationships in
rooted trees (Kendall & Colijn, MBE 2016)
- compare MRCA of each pair of tips
- use just topology or also branch lengths

26
Q

describing trees

some terms
* ancestor, ancestral lineage, basal group,
(monophyletic) clade, diversification, homolog,
ingroup, lineage, outgroup, sister group, node,
species tree, gene tree, paralog, ortholog, root,
duplication event, speciation event, …

?

A
27
Q

EXAM QUESTION

describe process of NJ and ML and name and describe one method to test validity of found clades (2022)

A

NJ
- distance-based
- Like a cookbook recipe
- very fast greedy heuristic
- use MSAs to make distance matrix
- compute only a single tree

ML
- character-based
- calculate likelihood of a phylogeny = probability of a dataset (MSA) given a model (substitution matrix and a tree)
- evaluate many different trees and pick the optimal one

test validity: bootstrapping
- create pseudosamples of the same length as original MSA but with columns shuffled
- compute a phylogeny for each pseudo-sample
- count how many times a bipartition (group) appears
- label nodes from the original (best) tree with bootstrap proportions

28
Q

EXAM QUESTION

For which of the tree building methods we saw do multiple substitutions pose a larger difficulty? Explain your answer. (2019)

Which phylogeny reconstruction method does not correct for multiple substitutions and how does it present a problem (2020)

A

Maximum Parsimony (MP)

MP disadvantages
- uses an unrealistic model of substitutions
- does not correct for multiple substitutions

divergent sequences = more multiple substitutions -> MP doesn’t work here –> doesn’t make sense with today’s data

lit:
- inherent assumption of slow rate of evolution (so slow that multiple hits are negligible).
- molecular phylogenetics moving towards resolving deep phylogenies (typically involve sequences with multiple subs at the same site)
- –> MP has gradually faded away, like an old soldier.

29
Q

Explain MP with an example

A

We have this MSA:

123456789

seqA GTCGACTCC

seqB GTAGACTAC

seqC GCGGCCATC

seqD GCTGCCAGC

the possible trees are:

((A,B),C),D; ((A,C),B),D; ((B,C),A),D;

we go column by column, checking how many mutations would be needed for each different possible tree of Nts
- 1st column: skip cos all the same
- 2nd column: ((T,T),C),C; = 1 mut / ((T,C),T),C; = 2 mut / ((T,C),T),C; = 2 mut
- etc

add up the total mutations for each topology and choose the one with the lowest value