12 | Phylogenetics I Flashcards

1
Q

What do we need / need to ask in order to get from sequences to a phylogeny?

A

Sequences

–>

MSA:
- which sequences, which MSA method?
- alignment/data appropriate for question?
- use the entire alignment?

–>

Algorithm/software to infer phylogeny from MSA
- which method?
- can we use entire alignment or need to remove or mask something?

–>

Phlyogeny
- gene trees / species trees
- statistical support?
- biological interpretation?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is an optimal alignment for phylogenetics?

And what does this mean in more detail?

A

what is an optimal alignment ?

evolutionary optimal!

= aligned residues are homologous,
share a common ancestry
–> positional homology

MSA, in the context of evolutionary analysis:
a hypothesis about the positional homology of
residues in homologous sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Define positional homology and phlylogenetic signal

A

Positional homology
- aligned residues share a common ancestral residue in the ancestral sequences
- changes in the columns correspond to mutations
- these contain the phylogenetic signal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What three ways could you describe alignment regions in regards to how they influence a phylogeny, and how should each be treated?

A

positionally homologous
–> contain the phylogenetic signal

uninformative
- highly divergent, many gaps
- correct or incorrectly aligned
- contain no/little phylogenetic signal
–> not necessary to exclude

incorrectly aligned
- positional homology violated
- e.g., non-homologous sequences, misalignment
- leads to incorrect result
–> should be excluded for best results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Removing / masking sequences:

What are the criteria for this?

What are the advantages?

Disadvantages?

A

trimming non-phylogenetic signal from alignments

criteria:
-gaps
- BLOSUM score per region?
–> different approaches

advantages:
assumed to improve accuracy of:
- tree topology
- branch lengths
- test for selection,…

disadvantages:
- might also inadvertently remove phylogenetic signal
- can also lead to decreased accuracy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Anatomy of a phylogeny

What is the end of a branch called?

A

tip, leaf, terminal node/vertex

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Anatomy of a phylogeny

Name the 4 parts

A
  • tip (leaf, terminal node/vertex)
  • branch (edge)
  • internal node
  • clade
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Cladogram vs phylogram?

A

cladogram: branch lengths meaningless
phylogramm: branch lengths proportional to amount of inferred evolutionary change

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is an unrooted tree?

A

Unrooted trees illustrate the relatedness of the leaf nodes without making assumptions about ancestry.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How can you root a tree?

A

using an outgroup
(also possible in similar way with paralog(s))

using “midpoint rooting”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is an unresolved tree?

A
  • we don’t know the relationship of all branches
  • multifurcating / non-binary (a polytomy)

due to networks or incompatible gene trees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a polytomy?

hard/soft?

A

polytomy:
unresolved node

  • hard polytomy: rapid divergence
  • soft polytomy: binary branching pattern not known, due to insufficient or conflicting data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a gene tree?

What does it depict? Which events?

A

Phylogeny depicting the evolution of homologous sequences

events:
- speciation
- duplication
- loss
- horizontal transfer,
- hybridization
- introgression
- incomplete lineage sorting
- …

phylogeny: a hypothesis that depicts the historical relationships among entities in a branching diagram –> for gene tree those entities are functional domains, gene sequences, or genomic regions (not genomes or organisms!)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Define ortholog

A

diverged after a speciation event
(last common ancestor is a speciation node)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Define paralog

A

diverged after a duplication event
(last common ancestor is a duplication node)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Gene loss: how can this occur?

A

frequent loss (pseudogenization, physical loss)

17
Q

Define In-/Out-paralogy

A

paralogous genes arising from lineage-specific duplication(s) after/
before a given speciation event.

18
Q

What can you use a gene tree as?

A

use a gene tree as a gene tree
* evolution of genes and gene function
* depending on the scope and question, requires orthologs or orthologs & paralogs,and always the most comprehensive set available

use a gene tree as a proxy for a species tree
* evolution of organisms: systematics, conservation biology, historical perspective
* requires strict orthologs (e.g., rRNA) or methods specifically designed to accommodate paralogs

19
Q

How are phylogenies inferred?

A

Phylogenetic inference = Statistical inference

sequences evolve along trees via stochastic processes

hypotheses (statistical models!) about these stochastic processes are used to estimate the evolutionary history from sequence data

20
Q

Describe two types of substitution models for inferring phylogenies + examples

A

Substitution models:

models of amino acid replacement
- pre-computed
- eg PAM, BLOSUM, WAG, many more

models of DNA replacement
- different general models exist
- rates & base composition: often estimated from the data
- eg Jukes Cantor

21
Q

What rates does Jukes Cantor have?

A

simplest one: equal frequencies, same mutation rates

subst rates: a=b=c=d=e=f (all same rate)

base frequencies: πA = πC = πG = πT

22
Q

What is rate variation and how is it commonly modelled?

A

(aka rate hetergoenity)

  • rate variation across sites
    • some evolve slower, some proportionally faster
    • among regions/sites, within genes: conserved domains/motifs / first vs. third codon positions / non-coding vs. coding
  • among genes: slow and fast evolving genes!

commonly modeled using the gamma distribution

23
Q

What is the gamma distribution and what is it used for?

A

The gamma distribution
- models rate heterogeneity over alignment columns
- is implemented in many software packages
- is determined by the shape parameter “alpha”
- alpha < 1: strong among-site variation
- higher alpha: lower rate heterogeneity (looks more like normal dist)

you get very different shapes depending on alpha

24
Q

How do you select a model?

A

especially important for statistical methods for phylogenetic estimation! (eg in 1st lab on pylogenetics)

statistical tests
- which is the best evolutionary model? (LRT, AIC, BIC)
* is there rate heterogeneity?
* do different alignment regions evolve under
evolutionary models?
* software: ProtTest, ModelTest, most frequent used: PartitionFinder, …

25
Q

What are the two general approaches to computing a phylogeny?

Which is most often done in practice and why?

A

distance methods - most often done in practice because very fast

character-based methods

26
Q

Explain basic concept of distance methods for computing a phylogeny and give an example of a method.

A
  • Like a cookbook recipe
  • very fast greedy heuristic
  • use MSAs to make distance matrix
  • compute only a single tree

eg neighbor joining (NJ)

27
Q

What are character based methods for computing a phylogeny? Give some examples

A

optimality based approaches: evaluate many different trees and pick the optimal one

can be broken down further into:

  • parsimony
  • statistical methods
    • eg maximum likelihood (ML)
    • eg Bayesian inference
28
Q

What do you need to compute a distance matrix?

A
  • MSA
  • substitution matrix / model
    • for AAs: substitution matrix eg BLOSUM, PAM
    • for NTs: available models eg JC
  • pairwise differences: Pdiff(t) = observed proportion of pairwise differences
29
Q

Formula for number of substitutions per site over time t - JC difference?

A

K(t) = -3/4 log (1 - 4/3 * Pdiff(t)

30
Q

What is the pairwise difference if two sequences are 97% similar? and 70%?

What are the resulting K(t) values?

What can be said when comparing these two results?

A

97% similar –> Pdiff(t) = 3% = 0.03
-0.75 × log(1-(1.33 × 0.03)) = 0.0305

70% similar –> Pdiff(t) = 30% = 0.3
-0.75 × log(1-(1.33 × 0.3)) = 0.3819

lower Pdiff(t) results in a K(t) that is quite similar

Higher Pdiff(t) – K(t) is less similar

31
Q

What do DNA substitution models generally correct for?

Why?

A

For multiple substitutions at same site.

chance for same mutation at same position is high
–> this could have been multiple substitutions
this is why differences are corrected to be higher than observed

32
Q

What is the next step to get a phylogeny once we have a matrix of distances?

A

we match distances to topology & branch lengths

true evolutionary distances would fit a single tree

BUT stochastic error & inappropriate models usually result in deviation from true evolutionary distances ➡ distances don’t fit the tree exactly, need to approximated!

33
Q

What are the advantages and disadvantages of a distance method

A

Advantages
- simple
- very fast greedy heuristic
- it often works correctly

Disadvantages
- probably won’t find a tree that exactly matches the distances
–> branch lengths on neighbour joining tree should be taken with grain of salt
- NJ: negative distances, don’t make sense biologically
- you lose data
- you only get one tree

34
Q

Neighbor Joining

outline

A

initialize
- matrix of pairwise distances under a substitution model
- star phylogeny
progressively add nodes until the tree is fully resolved
- compute pw average & minimal distances
- merge least distant tips into a new node
- compute distances from new node to all remaining tips
- repeat last two steps until only two nodes remain

35
Q

Newick format

what is different punctuation and what does it mean

A

( ) encloses 2 sisters / a group / a clade
, shown between two sisters
: follows a leaf or node and indicates distance backwards to next node
; means end of tree

36
Q

Newick format

give basic example and describe it

A

((gene1:dist1,gene2:dist2):dist3,gene3:dist4)clade1:dist5;

gene1 and gene2 are sisters
gene1+gene2 and gene3 are sister groups
the distance from clade1 to the next node backwards is dist5

37
Q

EXAM QUESTION

In the lectures, we went over the concept of guide tree in two ocassions. Describe the use of guide trees in each of those contexts. (2019)

2 topics where guide tree was mentioned and how (2020)

A

MSA and distance matrixes for phylogenies eg NJ?

38
Q

How can you root a tree using an outgroup?

A

Rooting using an outgroup :

  • outgroup: a sequence that is more distantly related to each of the ingroup species than these are to each other
  • requires prior / independent information
  • root is placed on the branch where the outgroup joins
  • (also possible in similar way with paralog(s))
39
Q

What is midpoint rooting?

A

Rooting a tree using “midpoint rooting”
- place the root halfway between the two most distant taxa (calculate distance from tip to tip)
- assumes a molecular clock (tree has constant rates of evolution)