12 | Phylogenetics I Flashcards
What do we need / need to ask in order to get from sequences to a phylogeny?
Sequences
–>
MSA:
- which sequences, which MSA method?
- alignment/data appropriate for question?
- use the entire alignment?
–>
Algorithm/software to infer phylogeny from MSA
- which method?
- can we use entire alignment or need to remove or mask something?
–>
Phlyogeny
- gene trees / species trees
- statistical support?
- biological interpretation?
What is an optimal alignment for phylogenetics?
And what does this mean in more detail?
what is an optimal alignment ?
evolutionary optimal!
= aligned residues are homologous,
share a common ancestry
–> positional homology
MSA, in the context of evolutionary analysis:
a hypothesis about the positional homology of
residues in homologous sequences
Define positional homology and phlylogenetic signal
Positional homology
- aligned residues share a common ancestral residue in the ancestral sequences
- changes in the columns correspond to mutations
- these contain the phylogenetic signal
What three ways could you describe alignment regions in regards to how they influence a phylogeny, and how should each be treated?
positionally homologous
–> contain the phylogenetic signal
uninformative
- highly divergent, many gaps
- correct or incorrectly aligned
- contain no/little phylogenetic signal
–> not necessary to exclude
incorrectly aligned
- positional homology violated
- e.g., non-homologous sequences, misalignment
- leads to incorrect result
–> should be excluded for best results
Removing / masking sequences:
What are the criteria for this?
What are the advantages?
Disadvantages?
trimming non-phylogenetic signal from alignments
criteria:
-gaps
- BLOSUM score per region?
–> different approaches
advantages:
assumed to improve accuracy of:
- tree topology
- branch lengths
- test for selection,…
disadvantages:
- might also inadvertently remove phylogenetic signal
- can also lead to decreased accuracy
Anatomy of a phylogeny
What is the end of a branch called?
tip, leaf, terminal node/vertex
Anatomy of a phylogeny
Name the 4 parts
- tip (leaf, terminal node/vertex)
- branch (edge)
- internal node
- clade
Cladogram vs phylogram?
cladogram: branch lengths meaningless
phylogramm: branch lengths proportional to amount of inferred evolutionary change
What is an unrooted tree?
Unrooted trees illustrate the relatedness of the leaf nodes without making assumptions about ancestry.
How can you root a tree?
using an outgroup
(also possible in similar way with paralog(s))
using “midpoint rooting”
What is an unresolved tree?
- we don’t know the relationship of all branches
- multifurcating / non-binary (a polytomy)
due to networks or incompatible gene trees
What is a polytomy?
hard/soft?
polytomy:
unresolved node
- hard polytomy: rapid divergence
- soft polytomy: binary branching pattern not known, due to insufficient or conflicting data
What is a gene tree?
What does it depict? Which events?
Phylogeny depicting the evolution of homologous sequences
events:
- speciation
- duplication
- loss
- horizontal transfer,
- hybridization
- introgression
- incomplete lineage sorting
- …
phylogeny: a hypothesis that depicts the historical relationships among entities in a branching diagram –> for gene tree those entities are functional domains, gene sequences, or genomic regions (not genomes or organisms!)
Define ortholog
diverged after a speciation event
(last common ancestor is a speciation node)
Define paralog
diverged after a duplication event
(last common ancestor is a duplication node)
Gene loss: how can this occur?
frequent loss (pseudogenization, physical loss)
Define In-/Out-paralogy
paralogous genes arising from lineage-specific duplication(s) after/
before a given speciation event.
What can you use a gene tree as?
use a gene tree as a gene tree
* evolution of genes and gene function
* depending on the scope and question, requires orthologs or orthologs & paralogs,and always the most comprehensive set available
use a gene tree as a proxy for a species tree
* evolution of organisms: systematics, conservation biology, historical perspective
* requires strict orthologs (e.g., rRNA) or methods specifically designed to accommodate paralogs
How are phylogenies inferred?
Phylogenetic inference = Statistical inference
sequences evolve along trees via stochastic processes
hypotheses (statistical models!) about these stochastic processes are used to estimate the evolutionary history from sequence data
Describe two types of substitution models for inferring phylogenies + examples
Substitution models:
models of amino acid replacement
- pre-computed
- eg PAM, BLOSUM, WAG, many more
models of DNA replacement
- different general models exist
- rates & base composition: often estimated from the data
- eg Jukes Cantor
What rates does Jukes Cantor have?
simplest one: equal frequencies, same mutation rates
subst rates: a=b=c=d=e=f (all same rate)
base frequencies: πA = πC = πG = πT
What is rate variation and how is it commonly modelled?
(aka rate hetergoenity)
- rate variation across sites
- some evolve slower, some proportionally faster
- among regions/sites, within genes: conserved domains/motifs / first vs. third codon positions / non-coding vs. coding
- among genes: slow and fast evolving genes!
commonly modeled using the gamma distribution
What is the gamma distribution and what is it used for?
The gamma distribution
- models rate heterogeneity over alignment columns
- is implemented in many software packages
- is determined by the shape parameter “alpha”
- alpha < 1: strong among-site variation
- higher alpha: lower rate heterogeneity (looks more like normal dist)
you get very different shapes depending on alpha
How do you select a model?
especially important for statistical methods for phylogenetic estimation! (eg in 1st lab on pylogenetics)
statistical tests
- which is the best evolutionary model? (LRT, AIC, BIC)
* is there rate heterogeneity?
* do different alignment regions evolve under
evolutionary models?
* software: ProtTest, ModelTest, most frequent used: PartitionFinder, …