12 | Phylogenetics I Flashcards by Stevie Davies

What do we need / need to ask in order to get from sequences to a phylogeny?

Sequences

–>

MSA:
- which sequences, which MSA method?
- alignment/data appropriate for question?
- use the entire alignment?

–>

Algorithm/software to infer phylogeny from MSA
- which method?
- can we use entire alignment or need to remove or mask something?

–>

Phlyogeny
- gene trees / species trees
- statistical support?
- biological interpretation?

How well did you know this?

Not at all

Perfectly

What is an optimal alignment for phylogenetics?

And what does this mean in more detail?

what is an optimal alignment ?

evolutionary optimal!

= aligned residues are homologous,
share a common ancestry
–> positional homology

MSA, in the context of evolutionary analysis:
a hypothesis about the positional homology of
residues in homologous sequences

How well did you know this?

Not at all

Perfectly

Define positional homology and phlylogenetic signal

Positional homology
- aligned residues share a common ancestral residue in the ancestral sequences
- changes in the columns correspond to mutations
- these contain the phylogenetic signal

How well did you know this?

Not at all

Perfectly

What three ways could you describe alignment regions in regards to how they influence a phylogeny, and how should each be treated?

positionally homologous
–> contain the phylogenetic signal

uninformative
- highly divergent, many gaps
- correct or incorrectly aligned
- contain no/little phylogenetic signal
–> not necessary to exclude

incorrectly aligned
- positional homology violated
- e.g., non-homologous sequences, misalignment
- leads to incorrect result
–> should be excluded for best results

How well did you know this?

Not at all

Perfectly

Removing / masking sequences:

What are the criteria for this?

What are the advantages?

Disadvantages?

trimming non-phylogenetic signal from alignments

criteria:
-gaps
- BLOSUM score per region?
–> different approaches

advantages:
assumed to improve accuracy of:
- tree topology
- branch lengths
- test for selection,…

disadvantages:
- might also inadvertently remove phylogenetic signal
- can also lead to decreased accuracy

How well did you know this?

Not at all

Perfectly

Anatomy of a phylogeny

What is the end of a branch called?

tip, leaf, terminal node/vertex

How well did you know this?

Not at all

Perfectly

Anatomy of a phylogeny

Name the 4 parts

tip (leaf, terminal node/vertex)
branch (edge)
internal node
clade

How well did you know this?

Not at all

Perfectly

Cladogram vs phylogram?

cladogram: branch lengths meaningless
phylogramm: branch lengths proportional to amount of inferred evolutionary change

How well did you know this?

Not at all

Perfectly

What is an unrooted tree?

Unrooted trees illustrate the relatedness of the leaf nodes without making assumptions about ancestry.

How well did you know this?

Not at all

Perfectly

How can you root a tree?

using an outgroup
(also possible in similar way with paralog(s))

using “midpoint rooting”

How well did you know this?

Not at all

Perfectly

What is an unresolved tree?

we don’t know the relationship of all branches
multifurcating / non-binary (a polytomy)

due to networks or incompatible gene trees

How well did you know this?

Not at all

Perfectly

What is a polytomy?

hard/soft?

polytomy:
unresolved node

hard polytomy: rapid divergence
soft polytomy: binary branching pattern not known, due to insufficient or conflicting data

How well did you know this?

Not at all

Perfectly

What is a gene tree?

What does it depict? Which events?

Phylogeny depicting the evolution of homologous sequences

events:
- speciation
- duplication
- loss
- horizontal transfer,
- hybridization
- introgression
- incomplete lineage sorting
- …

phylogeny: a hypothesis that depicts the historical relationships among entities in a branching diagram –> for gene tree those entities are functional domains, gene sequences, or genomic regions (not genomes or organisms!)

How well did you know this?

Not at all

Perfectly

Define ortholog

diverged after a speciation event
(last common ancestor is a speciation node)

How well did you know this?

Not at all

Perfectly

Define paralog

diverged after a duplication event
(last common ancestor is a duplication node)

How well did you know this?

Not at all

Perfectly

Gene loss: how can this occur?

frequent loss (pseudogenization, physical loss)

Define In-/Out-paralogy

paralogous genes arising from lineage-specific duplication(s) after/
before a given speciation event.

What can you use a gene tree as?

use a gene tree as a gene tree
* evolution of genes and gene function
* depending on the scope and question, requires orthologs or orthologs & paralogs,and always the most comprehensive set available

use a gene tree as a proxy for a species tree
* evolution of organisms: systematics, conservation biology, historical perspective
* requires strict orthologs (e.g., rRNA) or methods specifically designed to accommodate paralogs

How are phylogenies inferred?

Phylogenetic inference = Statistical inference

sequences evolve along trees via stochastic processes

hypotheses (statistical models!) about these stochastic processes are used to estimate the evolutionary history from sequence data

Describe two types of substitution models for inferring phylogenies + examples

Substitution models:

models of amino acid replacement
- pre-computed
- eg PAM, BLOSUM, WAG, many more

models of DNA replacement
- different general models exist
- rates & base composition: often estimated from the data
- eg Jukes Cantor

What rates does Jukes Cantor have?

simplest one: equal frequencies, same mutation rates

subst rates: a=b=c=d=e=f (all same rate)

base frequencies: πA = πC = πG = πT

What is rate variation and how is it commonly modelled?

(aka rate hetergoenity)

rate variation across sites
- some evolve slower, some proportionally faster
- among regions/sites, within genes: conserved domains/motifs / first vs. third codon positions / non-coding vs. coding
among genes: slow and fast evolving genes!

commonly modeled using the gamma distribution

What is the gamma distribution and what is it used for?

The gamma distribution
- models rate heterogeneity over alignment columns
- is implemented in many software packages
- is determined by the shape parameter “alpha”
- alpha < 1: strong among-site variation
- higher alpha: lower rate heterogeneity (looks more like normal dist)

you get very different shapes depending on alpha

How do you select a model?

especially important for statistical methods for phylogenetic estimation! (eg in 1st lab on pylogenetics)

statistical tests
- which is the best evolutionary model? (LRT, AIC, BIC)
* is there rate heterogeneity?
* do different alignment regions evolve under
evolutionary models?
* software: ProtTest, ModelTest, most frequent used: PartitionFinder, …

What are the two general approaches to computing a phylogeny? Which is most often done in practice and why?

distance methods - most often done in practice because very fast character-based methods

Explain basic concept of distance methods for computing a phylogeny and give an example of a method.

- Like a cookbook recipe - very fast greedy heuristic - use MSAs to make distance matrix - compute only a single tree eg neighbor joining (NJ)

What are character based methods for computing a phylogeny? Give some examples

optimality based approaches: evaluate many different trees and pick the optimal one can be broken down further into: - parsimony - statistical methods - eg maximum likelihood (ML) - eg Bayesian inference

What do you need to compute a distance matrix?

- MSA - substitution matrix / model - for AAs: substitution matrix eg BLOSUM, PAM - for NTs: available models eg JC - pairwise differences: P_diff(t) = observed proportion of pairwise differences

Formula for number of substitutions per site over time t - JC difference?

K(t) = -3/4 log (1 - 4/3 * P_diff(t)

What is the pairwise difference if two sequences are 97% similar? and 70%? What are the resulting K(t) values? What can be said when comparing these two results?

97% similar --> P_diff(t) = 3% = 0.03 -0.75 × log(1-(1.33 × 0.03)) = 0.0305 70% similar --> P_diff(t) = 30% = 0.3 -0.75 × log(1-(1.33 × 0.3)) = 0.3819 lower P_diff(t) results in a K(t) that is quite similar Higher P_diff(t) -- K(t) is less similar

What do DNA substitution models generally correct for? Why?

For multiple substitutions at same site. chance for same mutation at same position is high --> this could have been multiple substitutions this is why differences are corrected to be higher than observed

What is the next step to get a phylogeny once we have a matrix of distances?

we match distances to topology & branch lengths true evolutionary distances would fit a single tree BUT stochastic error & inappropriate models usually result in deviation from true evolutionary distances ➡ distances don’t fit the tree exactly, need to approximated!

What are the advantages and disadvantages of a distance method

Advantages - simple - very fast greedy heuristic - it often works correctly Disadvantages - probably won't find a tree that exactly matches the distances --> branch lengths on neighbour joining tree should be taken with grain of salt - NJ: negative distances, don't make sense biologically - you lose data - you only get one tree

Neighbor Joining outline

initialize - matrix of pairwise distances under a substitution model - star phylogeny progressively add nodes until the tree is fully resolved - compute pw average & minimal distances - merge least distant tips into a new node - compute distances from new node to all remaining tips - repeat last two steps until only two nodes remain

Newick format what is different punctuation and what does it mean

( ) encloses 2 sisters / a group / a clade , shown between two sisters : follows a leaf or node and indicates distance backwards to next node ; means end of tree

Newick format give basic example and describe it

((gene1:dist1,gene2:dist2):dist3,gene3:dist4)clade1:dist5; gene1 and gene2 are sisters gene1+gene2 and gene3 are sister groups the distance from clade1 to the next node backwards is dist5

EXAM QUESTION In the lectures, we went over the concept of guide tree in two ocassions. Describe the use of guide trees in each of those contexts. (2019) 2 topics where guide tree was mentioned and how (2020)

MSA and distance matrixes for phylogenies eg NJ?

How can you root a tree using an outgroup?

Rooting using an outgroup : - outgroup: a sequence that is more distantly related to each of the ingroup species than these are to each other - requires prior / independent information - root is placed on the branch where the outgroup joins - (also possible in similar way with paralog(s))

What is midpoint rooting?

Rooting a tree using “midpoint rooting” - place the root halfway between the two most distant taxa (calculate distance from tip to tip) - assumes a molecular clock (tree has constant rates of evolution)