7 - Trees and distance methods Flashcards
What are four different methods for inferring phylogeny?
- Distance matrix methods: pairwise distances between all sequences in alignment
- Parsimony-based methods
- Maximum likelihood methods
- Bayesian methods
What are the two steps to distance matrix analysis of phylogeny?
- Calculating the pairwise distances between all species (trimmed multiple alignment) to make a distance matrix
- Infer phylogenetic tree from distance matrix by algorithmic method (eg. NJ, UPGMA) or by optimality criterion method (eg. least squares)
How do you calculate p (Hammings distance)?
observed changes / # positions in sequence
or
p = 1 - (proportion of identical sites) = 1 - identity
This is the observed proportion of differences and is used to recover an accurate tree distance (D)
List the different models for calculating tree distance
- Jukes-Cantor one parameter model (all subs at equal rates)
D = -3/4ln(1 - (p)4/3) - Kimura 2-parameter process (transition not equal to transversion)
D = 1/2ln(1 / (1 - 2P - Q) + 1/4ln(1/1-2Q)
These can be fitted to a gamma distribution which will quickly show the proportions of sites with slow, medium and fast substitution (evolution) rates.
Shape is governed by the shape parameter, alpha (α),
What happens when you ignore among-site rate variation in finding tree distance (D)?
UNDERestimation of actual distance.
Give two methods for inferring a tree from the distance matrix
Algorithmic methods
- Unweighted pair group method with arithmetic (UPGMA)
- Neighbour joining
- BIONJ and WEIGHBOR
Optimality criterion-based methods
- Minimum evolution
- Fitch-Margoliash
- Least-squares
Describe the UPGMA method for tree reconstruction
Assumes rate of evolution is constant in all organismal lineages so that distance (D) is a linear function of time (T), it assumes a molecular clock.
It assumes the distances are ultrametric, which they typically are not.
Starts off by clustering the first pair of taxa with the smallest distance, then the next smallest distance is found and its branching point is calculated
How do tree distance matrices programs treat gap site containing columns?
They delete them.
Replacing them with ? or ‘-‘
List four problems with UPGMA
- Assumes the data reflects an ultrametric tree
- Tends to move more divergent sequences deeper into the tree (long branch attraction artefact)
LBA is one of the biggest pitfalls in molecular phylogeny
What is long branch attraction?
Long branch attraction (LBA) causes species to seem more closely related in a phylogeny than they really are due to mutations or traits occurring independently (convergent evolution) or FASTER. These shared traits can be misinterpreted as being shared due to common ancestry.
UPGMA is especially bad at this.
Describe the neighbour-joining method of tree distances
Unlike UPGMA, NJ does not require a molecular clock, only additive distances (ie. that distances between taxa can be represented by a tree structure)
- This allows rates to vary in different lineages
- Algorithms that seeks out neighbours (closest pairs of sequences)
- Starts with a star-tree and sequentially pairs up taxa to minimize the total length implied by the tree (lowest score is best)