lecture 3 Flashcards
cladograms, phylograms, and chronograms have
branch lengths that contain useful information.
phylogenetic methods
- goal: need to estimate phylogeny.
- approach:
1. collect data (evidence)
2. align data (homology)
3. find the “best” tree (phylogenetic analysis)
types of data
morphology and molecular.
phylogenetic methods
distance-based methods, maximum parsimony, maximum likelihood, and Bayesian inference.
distance based methods steps
- step 1: convert data to a measure of genetic distance between each pair of sequences.
- step 2: calculate a tree using the table of genetic disorders.
distance based methods process
- sequence alignment
- distances between sequences -> distance tables
- calculate unrooted tree
distanced based methods pros
ultra-fast and can handle large number of species.
distance based methods cons
replaces sequences with distances, has problems when distances are tied, and no model of evolution.
maximum parsimony
find the tree that explains the observed data with a minimal number of changes. choosing the tree with the least steps or minimal number of substitutions.
maximum parsimony steps
- step 1: propose a tree and compute parsimony score (least number of changes for a given tree).
- step 2: repeat step 1 until the tree with the minimum number of changes is found.
- changes: informative vs uninformative
maximum parsimony pros
simple to understand.
maximum parsimony cons
easily trapped on local max, no branch lengths, too many equally parsimonious trees, statistically inconsistent (fails to find the correct tree; more data only increases support for incorrect result).
bootstrap
calculating support for relationships. how strong is the signal in the data?
- statistical procedure: random re-sampling of the data, with replacement.
- pseudoreplicates
maximum likelihood
what is the probability of observing a set of data given a hypothesis? the equation of the conditional probability is: likelihood = probability (data | hypothesis) or P(data | tree).
- calculate the likelihood for each hypothesis. look for the hypothesis with the highest likelihood.
- propose new topology, branch lengths, model parameters -> calculate scores; repeat.
substitutions model (max. likelihood)
model the rates of changes among base pairs.
- 1 rate: all changes equal
- 2 rates: transitions vs. transversions
- 6 rates: unique rates - general time - reversible model
maximum likelihood pros
statistically consistent (guaranteed accuracy with sufficient data).
maximum likelihood cons
slow, “hill climbing,” which makes it easy to be trapped on a local max.
hill climbing
maximum parsimony and maximum likelihood objective is to find the ““maximum” solution.
- propose new: topology (ML and MP), branch lengths (ML only), model parameters (ML only) -> calculate score; repeat.
is there really an optimal tree?
can’t enumerate all possible trees for 10+ species.
Bayesian phylogenetics
instead of trying to find the “maximum” solution, we summarize the entire distribution using simulation called Markov Chain Monte Carlo (MCMC).
MP and ML vs Bayesian
- provides a single “point estimation” vs provides a probability distribution
- requires bootstrapping vs probabilities are provided
- can get trapped on local max vs can escape local max
- no uncertainty in estimates vs estimate uncertainty for each parameter.
Baye’s theorem
Pr(tree | data) = (Pr(data | tree) * Pr(tree)) / Pr(data)
posterior probability = likelihood * prior probability
marginal likelihood
summation over all trees and, for each tree, integration over all possible combinations of branch length and substitution model parameter values.
prior distribution
probability assumed before observing data.
likelihood
L = P(data | tree), same as one used in ML.
posterior distribution
combination of the prior and likelihood.
how do you put a prior on a phylogenetic tree?
give all trees equal prior probability.
Start at a random location and follow these two MCMC rules:
- if the proposed step takes you uphill, you automatically take the step.
- if the proposed step takes you downhill, you take the step at “random.”
- approximates the posterior distribution.
- convergence - how long to run the MCMC analysis?
- are you sampling from the “global peak”?
- have you collected enough samples to summarize?
- burn in - how many of the initial samples are useless?
- you being at a random spot, when did you reach the “goal”?
tree space
each dot is a tree visited during the MCMC. the number of times a tree is visited is proportional to the probability of the tree.
phylogenetic methods:
- distance: convert data to distance values and calculate the tree.
- maximum parsimony: the tree that explains the data with the least amount of evolutionary change.
- maximum likelihood: the tree with the highest probability of generating the observed data.
- bayesian - a probability distribution of trees based on prior knowledge and current data.