W2 L3 Phylogeny P3 TH Flashcards
Multiple sequence alignment
Goal: determine character homology
* Practical: insert gaps to make sequences line up
(hypotheses about site homologies resulting from historical insertion-deletion events)
Aligning protein-coding sequences
Various algorithms to remove ambiguous sites, e.g.
- SOAP: align using multiple settings
- GBLOCKS: score for contiguous conserved positions, lack of gaps, conserved flanking positions
- BMGE: gap frequencies, entropy
- TrimAl: gap frequencies, AA similarity, consistency across alignments
Model selection
-all model are wrong but somewhat useful. Trade off of overfitting and under fitting
-by adding parameters to a model, it would lead to improvement in fit but also increase variance
Aiming to make model fit
DNA sequences of extant species reflect the evolutionary processes that have acted on them
* Parameters of the model of sequence evolution specify in a statistical way how past changes have led to the present diversity of DNA sequences
* Not all genes and groups of organisms have had the same history
* Smart to evaluate many models and use one that seems fit to be used with the dataset in question
* Aim: identify model that yields good trade-off between the fit of the data to the model and the number of parameters that need to be fitted
* Fit can be measured with log-likelihood
* Score many models (bonus for good fit, penalty for many parameters)
* Pick the one with best score
Akaike Information Criterion (and related)
Other methods for model selection
- Decision theory
penalty for models that yield branch lengths deviating from those of other methods in the comparison - Bayesian model selectionpairwise model comparison with Bayes Factor
- Likelihood ratio tests (hLRT)much used in the past, serious over parameterization
Protein models
Empirical transition matrices
* These are estimated from large databases beforehand and implemented in software
* Many different matrices available, use model testing procedure to select suitable one
Phylogenetic uncertainty
Phylogenetic tree is often seen as a point estimate, as single best result
* However, every result has a “standard deviation”
* some relationships better supported by data than others
* often several trees with nearly identical likelihood
* Could thus be informative to obtain statistics indicating support for relationships
bootsrapping
Resampling data set with replacement until the same alignment length is reachEd
-run a ML for the bootstrap
Interpreting branch support
Outgroup rooting
Include related taxon or taxa from outside the group of interest
Outgroup choice often treated as an afterthought, but should be part of experimental design
Alternative rooting methods
- Molecular clock model (strict or relaxed)
- Midpoint rooting
- Non-reversible models
No need for outgroups: tree is automatically rooted
Accuracy often (but not always) lower
If unsure, compare different methods and choose the method of which the assumptions are least likely to be violated for the dataset being studied