Substitution models - phylogenetics Flashcards

1
Q

What is the pairwise distance?

A

P-distance = number of differences between two sequences / the total number of sites in the sequence (i.e. the proportion of differences between the sequences). This is the distance matrix used in distance matrixes for UPGMA and NJ.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why does the p-distance alone not represent the evolutionary distance between two sequences?

A

The p-distance does not take into account the number of changes in the site before the last change.

Meaning that if A was substituted for T in a sequence the p-distance would be 1 but maybe the change was from A to C to G to A to T = 4 changes – meaning that the evolutionary events = 4.

This means that the true distance is almost always underestimated and we have to correct for this.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How can we correct for the p-distance that understates the evolutionary distance between two sequences?

A

The correction of the p-distance is done by applying different evolutionary models that correct for the unobserved differences.

These models are based on the observed difference, and from that estimates the real distances using a model (a hypothesis) of the substitution process.

This requires that we make some assumptions about the substitution process.

So when you are constructing a tree you should do it using the substitution model that but fits your alignment data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What substitutions models are there?

A

The Jukes-Cantor model (JC69).

The Kimora model (K80)

The Felstein model(F81)

The Hasegawa, Kishino and Yano model (HKY85)

The general time reversible model (GTR).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain the JC69 model.

A

The JC69 model is the simplest of the models and it assumes:

  • Mutation rates are constant for all sites and the rate is equal for all nucleotides.
  • Nucleotide frequencies are the same (0.25)

This only gives one parameter which is the alfa mutation rate.

The model creates:
- a substitution rate matrix that has the rates for all substitutions (the same for all of them under the JC69 where the rows add up to 1
- a stationary distribution of nucleotide frequencies.

From this you can find the probability for a substitution from one time to another and the stationary distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a stationary distribution?

A

A stationary distribution shows the frequency of nucleotides when substitution does not occur for a long time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do you find the probability for a substitution from t to t+1 in under the JC69 model?

A

The JC assumes that all frequencies and rates are equal.

So the probability for a substitution at time t+1 equals the probability at time t multiplied with the substitution rate matrix.

[pA(t+1) pC(t+1) pG(t+1) pT(t+1)] = [pA(t) pC(t) pG(t) (pt(t)] x probability matrix (the column).

It describes the probability of change over a short period of time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do you solve for the stationary distribution under the JC69 model?

A

When you solve for the stationary distribution you solve for the frequency of the nucleotides in a state when they don’t change under a long period of time. Meaning that you assume that probability to have ACTG are the same at t and t+1.

This states that when you multiply the stationary distribution with the probability matrix you get the same stationary distribution back, signifying that the probabilities stay the same under the jukes-cantor model pA(t+1) = pA(t). So, the stationary distribution under the jukes-cantor model is 0.25 for each nucleotide.

stationary distribution = stationary distribution x rate matrix.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Describe the Kimura model

A

The K80 model is a bit more complex than the jukes-cantor.

This is because it assumes that there are different substitutions rates between transitions(A-G/C-T) and transversions (A-T/C-G) which gives two parameters in the substitution rate matrix.

It assumes that the frequencies are equal and it does not model for for variation in rate across different sites in the sequence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Describe the Felstein model

A

All substitutions rates are equal.

It allows for variance in nucleotide frequencies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Describe the HKY85 model of substitutions

A

It gives different rates to transitions and transversions by giving the parameter kappa the transition-to-transversion-ratio.

The model gives the ability to specify the base frequencies and can therefore accommodate for situations where frequencies vary

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Describe the general time reversible model of substitution

A

Allows for different substitution rate for all 6 substitutions.

Allows for unequal base frequencies.

It is time reversible: In a time-reversible model the rate at which a particular nucleotide is replaced by another is the same as the rate at which the reverse substitution occurs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

There is one thing that none of the substitution models alone can model for, what is it? How do we correct for it?

A

Different rate of substitutions on different sites of sequences.

Not all parts of the gene evolve in the same rate.

To correct for it we use the gamma distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the setback of using the Gamma distribution?

A

It does not allow for correlations in rate along the genome or changes in rate across time.

Rate variation is often correlated along genome and the rate often changes with time in genes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What information can we get from looking at a phylogenetic tree?

A

Phylogenetic trees help us retrace history and relationships between individuals or groups.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

When is it appropriate to reconstruct a phylogenetic tree?

A

The phylogenetic reconstruction of a tree is done after you have:
- gathered your data
- assembled a dataset
- perform MSA
- checked quality of MSA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the basic assumptions for phylogenetic reconstructions?

A

The sequences in a tree share a common ancestor

  • Mutations are accumulated from the common ancestor
  • Mutations are relatively rare. If A-B are more similar to each other in a alignment than to C, then A-B probably have a common ancestor that C does not have.
18
Q

Define the following terms:
- Paraphylectic group
- monophyletic group
- Polyphyletic group
- Homoplasies

A

Monophyletic groups = an ancestral lineage and all the descendants of that lineage. A group of organisms that consists of a common ancestor and all its descendants.

Paraphyletic group = Include ancestor and some but not all descendants of a common ancestor.

Polyphyletic group = organisms that have converged on a similar characteristic but do not share a common ancestor.

Homoplasies = Characteristics that are similar but have not been inherited from the same ancestor are known as homoplasies. The developments and phenomenon of homoplasies is known as convergent evolution.

19
Q

Define the following terms:
- Polytomy
- Bifurcating tree
- Orthologues
- Paralogous

A

Polytomy = A polytomy is a node with more than three branches connecting to it.

Bifurcating tree = A tree without polytomous nodes, the tree is fully resolved.

Orthologues = Orthologous are two genes in two different taxa that share a common ancestor and have the same function.

Paralogous = two genes in the same genome that are the product of a gene duplication event of the original gene.

20
Q

What are the distance based methods for retrieving a phylogenetic tree from multiple sequence alignment data?

A

UPGMA and Neighbor joining

21
Q

What algorithms for retrieving a phylogenetic tree are there?

A

Maximum parsimony
Maximum likelihood
Bayesian statistics

UPGMA
NJ

22
Q

What does the UPGMA algorithm do?

A

The UPGMA algorithm looks at a matrix of pairwise distance from sequence alignment of n number of taxa and based on the distances and branch length the algorithm creates a rooted phylogenetic tree of the different taxa.

23
Q

What are the steps of the UPGMA algorithm?

A

Create distance matrix

2.Join taxa with smallest distance to cluster

3.Calculate branch length from joined taxa to new node

4.Calculate distances from new node

Repeat until done.

24
Q

How do you calculate the new distance matrix after you’ve joined the taxons with the smallest distance in UPGMA?

A

Once the elements have joined to a new operational unit called a cluster you calculate the distance from the cluster to the remaining elements by calculating the average of the distance between the cluster and each element.

Since the cluster contain the joined elements you take the average of the distances between both joined elements to the remaining element.

D(u,c) = D(a,c) + D(b,c) / 2

25
Q

How do you calculate the branch length between two taxon and the new cluster they joined to create?

A

The distance between the joined elements / 2

26
Q

What is the setback of the UPGMA algorithm?

A

The result is an ultrametric rooted tree.

Ultrametric means that it is the same distance from all taxa to root. The ultrametric assumption makes the algorithm sensitive to unequal rates among lineages. Some taxa might have a higher evolutionary rate than others in the tree and the UPGMA will then give the wrong topology in the tree.

27
Q

What is the difference between UPGMA and NJ?

A

The NJ algorithm is also based on a distance matrix with n taxa but the difference from the UPGMA is that NJ allows for unequal rates of evolution so that branch lengths are proportional to amount of change and the result is an unrooted tree.

This algorithm is slightly more complex and is slower than the UPGMA.

28
Q

What is divergence when we talk about the NJ algorithm?

A

Key concept of the NJ algorithm is that divergence is a measure of total branch lengths in the neighbor joining process.

29
Q

What are the steps in the NJ algorithm?

A
  1. Distance matrix
  2. Create Q matrix
  3. Join taxon with smallest distance to new cluster
  4. Calculate branch lengths of joined taxa to new node
  5. Calculate distance from new node to the other taxons in new distance matrix.
  6. Repeat until done.
30
Q

What is the difference between D matrix and Q matrix in NJ algorithm?

A

The distance matrix (D) is not used directly like in the UPGMA but it is used to create a new matrix (Q) that has the net divergence of each taxon pair.

Divergence in this case is a measure of total branch length in the neighbor joining process.

31
Q

How do we estimate branch length from the joined taxa to the new node in the NJ algorithm?

A

If you have joined c,d to the new cluster u:

(c,u) = 0.5 x D(c,d) + ((Rc – Rd) / 2(n-2))

d,u) = D(c,d) – (c,u)

In NJ the branch length is original distance divided by two just like in the UPGMA but we preserve the remaining divergence.

31
Q

How do you calculate the Q matrix from the D matrix?

A

Q(x,y) = (n-2)D(x,y) – Rx – Ry.

Where Rx is all the distances to x and (n-2) is the degrees of freedom.

32
Q

How do you calculate the new distance matrix after joining two elements to a new cluster in NJ?

A

The same way as in UPGMA but we also subtract the original distance:

D(u,e) = D(c,e) + D(d,e) – D(c,d) / 2

32
Q

What is maximum parsimony?

A

Maximum parsimony is an optimality criteria.
The principle is to find the tree that minimizes the number of evolutionary changes. It is based on the fact that evolution is lazy and the short way is always the right way. In other words: the shortest tree is the most correct tree.

This tree must be found amongst all possible trees. For a small number of taxa (fewer than 10) it is possible to do an exhaustive search but for trees with a higher number of taxa a heuristic search must be done. This method is sensitive to unequal rates among lineages and long-branch attraction.

33
Q

What is the difference between maximum likelihood and maximum parsimony?

A

Maximum likelihood is a statistical approach that renders itself to various statistical tests of phylogenies, such as the likelihood ratio test. A likelihood analysis strives to find the tree that has the highest likelihood given the data (nucleotide or amino acid sequences) and a model of sequence evolution.

Parsimony aims to find the tree that has the lowest number of evolutionary changes.

34
Q

How is the Bayesian approach different than Maximum Likelihood? What is the 
difference between a likelihood and a posterior probability?

A

Likelihood: The likelihood is a function that measures the probability of observing the data given that the model used is true.
P(D|M).

Posterior probability: The posterior probability is the probability of a model given both the likelihood and prior probabilities.
P(M|D). For this you need the prior probability P(M). (The prior knowledge of the model, do you know that it is true/false?).

The maximum likelihood finds the tree that is most likely given our data and some parameters such as an evolutionary model and the Bayesian statistics incorporates the posterior probability meaning that it includes our prior knowledge.

35
Q

What is MCMC used for (especially in Bayesian statistics)?

A

Markov Chain Monte Carlo is used to sample from the posterior distribution when doing Bayesian statistics.

36
Q

Given an observed data, the model M has been shown to have likelihood x - what is needed to get the posterior probability?

A

A likelihood is the probability of the observed data given the model. P(D|M).

The posterior probability is the probability of the model given the likelihood and the prior probabilities.

You need the prior

37
Q

How is an Optimality Criteria used in phylogenetics)? Why are such methods much slower than clustering algorithms such as UPGMA?

A

Clustering methods are faster because they are linear algorithms, follow fewer rules, and they make fewer assumptions, in the case of UPGMA they just connect the most similar groups(shortest).

While optimality criteria methods are slower due to complexities such as looking for a model of evolution that best fits, making more statistical analysis like in the case of ML where you calculate the likelihood and they offer more robust and statistically grounded approaches for inferring phylogenetic trees and estimating branch lengths and model parameters.

38
Q

What is Bootstrapping used for and how is it done?

A

Bootstrapping is used to give greater confidence to your derived tree. The bootstrapping will derive a tree many times over and check to see how many of the times the splits were the same. If one split occurs many times it indicates that the chance of that split happening is higher than the random chance.