12 | Phylogenetics I Flashcards
What do we need / need to ask in order to get from sequences to a phylogeny?
Sequences
–>
MSA:
- which sequences, which MSA method?
- alignment/data appropriate for question?
- use the entire alignment?
–>
Algorithm/software to infer phylogeny from MSA
- which method?
- can we use entire alignment or need to remove or mask something?
–>
Phlyogeny
- gene trees / species trees
- statistical support?
- biological interpretation?
What is an optimal alignment for phylogenetics?
And what does this mean in more detail?
what is an optimal alignment ?
evolutionary optimal!
= aligned residues are homologous,
share a common ancestry
–> positional homology
MSA, in the context of evolutionary analysis:
a hypothesis about the positional homology of
residues in homologous sequences
Define positional homology and phlylogenetic signal
Positional homology
- aligned residues share a common ancestral residue in the ancestral sequences
- changes in the columns correspond to mutations
- these contain the phylogenetic signal
What three ways could you describe alignment regions in regards to how they influence a phylogeny, and how should each be treated?
positionally homologous
–> contain the phylogenetic signal
uninformative
- highly divergent, many gaps
- correct or incorrectly aligned
- contain no/little phylogenetic signal
–> not necessary to exclude
incorrectly aligned
- positional homology violated
- e.g., non-homologous sequences, misalignment
- leads to incorrect result
–> should be excluded for best results
Removing / masking sequences:
What are the criteria for this?
What are the advantages?
Disadvantages?
trimming non-phylogenetic signal from alignments
criteria:
-gaps
- BLOSUM score per region?
–> different approaches
advantages:
assumed to improve accuracy of:
- tree topology
- branch lengths
- test for selection,…
disadvantages:
- might also inadvertently remove phylogenetic signal
- can also lead to decreased accuracy
Anatomy of a phylogeny
What is the end of a branch called?
tip, leaf, terminal node/vertex
Anatomy of a phylogeny
Name the 4 parts
- tip (leaf, terminal node/vertex)
- branch (edge)
- internal node
- clade
Cladogram vs phylogram?
cladogram: branch lengths meaningless
phylogramm: branch lengths proportional to amount of inferred evolutionary change
What is an unrooted tree?
Unrooted trees illustrate the relatedness of the leaf nodes without making assumptions about ancestry.
How can you root a tree?
using an outgroup
(also possible in similar way with paralog(s))
using “midpoint rooting”
What is an unresolved tree?
- we don’t know the relationship of all branches
- multifurcating / non-binary (a polytomy)
due to networks or incompatible gene trees
What is a polytomy?
hard/soft?
polytomy:
unresolved node
- hard polytomy: rapid divergence
- soft polytomy: binary branching pattern not known, due to insufficient or conflicting data
What is a gene tree?
What does it depict? Which events?
Phylogeny depicting the evolution of homologous sequences
events:
- speciation
- duplication
- loss
- horizontal transfer,
- hybridization
- introgression
- incomplete lineage sorting
- …
phylogeny: a hypothesis that depicts the historical relationships among entities in a branching diagram –> for gene tree those entities are functional domains, gene sequences, or genomic regions (not genomes or organisms!)
Define ortholog
diverged after a speciation event
(last common ancestor is a speciation node)
Define paralog
diverged after a duplication event
(last common ancestor is a duplication node)
Gene loss: how can this occur?
frequent loss (pseudogenization, physical loss)
Define In-/Out-paralogy
paralogous genes arising from lineage-specific duplication(s) after/
before a given speciation event.
What can you use a gene tree as?
use a gene tree as a gene tree
* evolution of genes and gene function
* depending on the scope and question, requires orthologs or orthologs & paralogs,and always the most comprehensive set available
use a gene tree as a proxy for a species tree
* evolution of organisms: systematics, conservation biology, historical perspective
* requires strict orthologs (e.g., rRNA) or methods specifically designed to accommodate paralogs
How are phylogenies inferred?
Phylogenetic inference = Statistical inference
sequences evolve along trees via stochastic processes
hypotheses (statistical models!) about these stochastic processes are used to estimate the evolutionary history from sequence data
Describe two types of substitution models for inferring phylogenies + examples
Substitution models:
models of amino acid replacement
- pre-computed
- eg PAM, BLOSUM, WAG, many more
models of DNA replacement
- different general models exist
- rates & base composition: often estimated from the data
- eg Jukes Cantor
What rates does Jukes Cantor have?
simplest one: equal frequencies, same mutation rates
subst rates: a=b=c=d=e=f (all same rate)
base frequencies: πA = πC = πG = πT
What is rate variation and how is it commonly modelled?
(aka rate hetergoenity)
- rate variation across sites
- some evolve slower, some proportionally faster
- among regions/sites, within genes: conserved domains/motifs / first vs. third codon positions / non-coding vs. coding
- among genes: slow and fast evolving genes!
commonly modeled using the gamma distribution
What is the gamma distribution and what is it used for?
The gamma distribution
- models rate heterogeneity over alignment columns
- is implemented in many software packages
- is determined by the shape parameter “alpha”
- alpha < 1: strong among-site variation
- higher alpha: lower rate heterogeneity (looks more like normal dist)
you get very different shapes depending on alpha
How do you select a model?
especially important for statistical methods for phylogenetic estimation! (eg in 1st lab on pylogenetics)
statistical tests
- which is the best evolutionary model? (LRT, AIC, BIC)
* is there rate heterogeneity?
* do different alignment regions evolve under
evolutionary models?
* software: ProtTest, ModelTest, most frequent used: PartitionFinder, …
What are the two general approaches to computing a phylogeny?
Which is most often done in practice and why?
distance methods - most often done in practice because very fast
character-based methods
Explain basic concept of distance methods for computing a phylogeny and give an example of a method.
- Like a cookbook recipe
- very fast greedy heuristic
- use MSAs to make distance matrix
- compute only a single tree
eg neighbor joining (NJ)
What are character based methods for computing a phylogeny? Give some examples
optimality based approaches: evaluate many different trees and pick the optimal one
can be broken down further into:
- parsimony
- statistical methods
- eg maximum likelihood (ML)
- eg Bayesian inference
What do you need to compute a distance matrix?
- MSA
- substitution matrix / model
- for AAs: substitution matrix eg BLOSUM, PAM
- for NTs: available models eg JC
- pairwise differences: Pdiff(t) = observed proportion of pairwise differences
Formula for number of substitutions per site over time t - JC difference?
K(t) = -3/4 log (1 - 4/3 * Pdiff(t)
What is the pairwise difference if two sequences are 97% similar? and 70%?
What are the resulting K(t) values?
What can be said when comparing these two results?
97% similar –> Pdiff(t) = 3% = 0.03
-0.75 × log(1-(1.33 × 0.03)) = 0.0305
70% similar –> Pdiff(t) = 30% = 0.3
-0.75 × log(1-(1.33 × 0.3)) = 0.3819
lower Pdiff(t) results in a K(t) that is quite similar
Higher Pdiff(t) – K(t) is less similar
What do DNA substitution models generally correct for?
Why?
For multiple substitutions at same site.
chance for same mutation at same position is high
–> this could have been multiple substitutions
this is why differences are corrected to be higher than observed
What is the next step to get a phylogeny once we have a matrix of distances?
we match distances to topology & branch lengths
true evolutionary distances would fit a single tree
BUT stochastic error & inappropriate models usually result in deviation from true evolutionary distances ➡ distances don’t fit the tree exactly, need to approximated!
What are the advantages and disadvantages of a distance method
Advantages
- simple
- very fast greedy heuristic
- it often works correctly
Disadvantages
- probably won’t find a tree that exactly matches the distances
–> branch lengths on neighbour joining tree should be taken with grain of salt
- NJ: negative distances, don’t make sense biologically
- you lose data
- you only get one tree
Neighbor Joining
outline
initialize
- matrix of pairwise distances under a substitution model
- star phylogeny
progressively add nodes until the tree is fully resolved
- compute pw average & minimal distances
- merge least distant tips into a new node
- compute distances from new node to all remaining tips
- repeat last two steps until only two nodes remain
Newick format
what is different punctuation and what does it mean
( ) encloses 2 sisters / a group / a clade
, shown between two sisters
: follows a leaf or node and indicates distance backwards to next node
; means end of tree
Newick format
give basic example and describe it
((gene1:dist1,gene2:dist2):dist3,gene3:dist4)clade1:dist5;
gene1 and gene2 are sisters
gene1+gene2 and gene3 are sister groups
the distance from clade1 to the next node backwards is dist5
EXAM QUESTION
In the lectures, we went over the concept of guide tree in two ocassions. Describe the use of guide trees in each of those contexts. (2019)
2 topics where guide tree was mentioned and how (2020)
MSA and distance matrixes for phylogenies eg NJ?
How can you root a tree using an outgroup?
Rooting using an outgroup :
- outgroup: a sequence that is more distantly related to each of the ingroup species than these are to each other
- requires prior / independent information
- root is placed on the branch where the outgroup joins
- (also possible in similar way with paralog(s))
What is midpoint rooting?
Rooting a tree using “midpoint rooting”
- place the root halfway between the two most distant taxa (calculate distance from tip to tip)
- assumes a molecular clock (tree has constant rates of evolution)