Substitution models - phylogenetics Flashcards
What is the pairwise distance?
P-distance = number of differences between two sequences / the total number of sites in the sequence (i.e. the proportion of differences between the sequences). This is the distance matrix used in distance matrixes for UPGMA and NJ.
Why does the p-distance alone not represent the evolutionary distance between two sequences?
The p-distance does not take into account the number of changes in the site before the last change.
Meaning that if A was substituted for T in a sequence the p-distance would be 1 but maybe the change was from A to C to G to A to T = 4 changes – meaning that the evolutionary events = 4.
This means that the true distance is almost always underestimated and we have to correct for this.
How can we correct for the p-distance that understates the evolutionary distance between two sequences?
The correction of the p-distance is done by applying different evolutionary models that correct for the unobserved differences.
These models are based on the observed difference, and from that estimates the real distances using a model (a hypothesis) of the substitution process.
This requires that we make some assumptions about the substitution process.
So when you are constructing a tree you should do it using the substitution model that but fits your alignment data.
What substitutions models are there?
The Jukes-Cantor model (JC69).
The Kimora model (K80)
The Felstein model(F81)
The Hasegawa, Kishino and Yano model (HKY85)
The general time reversible model (GTR).
Explain the JC69 model.
The JC69 model is the simplest of the models and it assumes:
- Mutation rates are constant for all sites and the rate is equal for all nucleotides.
- Nucleotide frequencies are the same (0.25)
This only gives one parameter which is the alfa mutation rate.
The model creates:
- a substitution rate matrix that has the rates for all substitutions (the same for all of them under the JC69 where the rows add up to 1
- a stationary distribution of nucleotide frequencies.
From this you can find the probability for a substitution from one time to another and the stationary distribution.
What is a stationary distribution?
A stationary distribution shows the frequency of nucleotides when substitution does not occur for a long time.
How do you find the probability for a substitution from t to t+1 in under the JC69 model?
The JC assumes that all frequencies and rates are equal.
So the probability for a substitution at time t+1 equals the probability at time t multiplied with the substitution rate matrix.
[pA(t+1) pC(t+1) pG(t+1) pT(t+1)] = [pA(t) pC(t) pG(t) (pt(t)] x probability matrix (the column).
It describes the probability of change over a short period of time.
How do you solve for the stationary distribution under the JC69 model?
When you solve for the stationary distribution you solve for the frequency of the nucleotides in a state when they don’t change under a long period of time. Meaning that you assume that probability to have ACTG are the same at t and t+1.
This states that when you multiply the stationary distribution with the probability matrix you get the same stationary distribution back, signifying that the probabilities stay the same under the jukes-cantor model pA(t+1) = pA(t). So, the stationary distribution under the jukes-cantor model is 0.25 for each nucleotide.
stationary distribution = stationary distribution x rate matrix.
Describe the Kimura model
The K80 model is a bit more complex than the jukes-cantor.
This is because it assumes that there are different substitutions rates between transitions(A-G/C-T) and transversions (A-T/C-G) which gives two parameters in the substitution rate matrix.
It assumes that the frequencies are equal and it does not model for for variation in rate across different sites in the sequence.
Describe the Felstein model
All substitutions rates are equal.
It allows for variance in nucleotide frequencies.
Describe the HKY85 model of substitutions
It gives different rates to transitions and transversions by giving the parameter kappa the transition-to-transversion-ratio.
The model gives the ability to specify the base frequencies and can therefore accommodate for situations where frequencies vary
Describe the general time reversible model of substitution
Allows for different substitution rate for all 6 substitutions.
Allows for unequal base frequencies.
It is time reversible: In a time-reversible model the rate at which a particular nucleotide is replaced by another is the same as the rate at which the reverse substitution occurs.
There is one thing that none of the substitution models alone can model for, what is it? How do we correct for it?
Different rate of substitutions on different sites of sequences.
Not all parts of the gene evolve in the same rate.
To correct for it we use the gamma distribution.
What is the setback of using the Gamma distribution?
It does not allow for correlations in rate along the genome or changes in rate across time.
Rate variation is often correlated along genome and the rate often changes with time in genes.
What information can we get from looking at a phylogenetic tree?
Phylogenetic trees help us retrace history and relationships between individuals or groups.
When is it appropriate to reconstruct a phylogenetic tree?
The phylogenetic reconstruction of a tree is done after you have:
- gathered your data
- assembled a dataset
- perform MSA
- checked quality of MSA