Substitution models - phylogenetics Flashcards
What is the pairwise distance?
P-distance = number of differences between two sequences / the total number of sites in the sequence (i.e. the proportion of differences between the sequences). This is the distance matrix used in distance matrixes for UPGMA and NJ.
Why does the p-distance alone not represent the evolutionary distance between two sequences?
The p-distance does not take into account the number of changes in the site before the last change.
Meaning that if A was substituted for T in a sequence the p-distance would be 1 but maybe the change was from A to C to G to A to T = 4 changes – meaning that the evolutionary events = 4.
This means that the true distance is almost always underestimated and we have to correct for this.
How can we correct for the p-distance that understates the evolutionary distance between two sequences?
The correction of the p-distance is done by applying different evolutionary models that correct for the unobserved differences.
These models are based on the observed difference, and from that estimates the real distances using a model (a hypothesis) of the substitution process.
This requires that we make some assumptions about the substitution process.
So when you are constructing a tree you should do it using the substitution model that but fits your alignment data.
What substitutions models are there?
The Jukes-Cantor model (JC69).
The Kimora model (K80)
The Felstein model(F81)
The Hasegawa, Kishino and Yano model (HKY85)
The general time reversible model (GTR).
Explain the JC69 model.
The JC69 model is the simplest of the models and it assumes:
- Mutation rates are constant for all sites and the rate is equal for all nucleotides.
- Nucleotide frequencies are the same (0.25)
This only gives one parameter which is the alfa mutation rate.
The model creates:
- a substitution rate matrix that has the rates for all substitutions (the same for all of them under the JC69 where the rows add up to 1
- a stationary distribution of nucleotide frequencies.
From this you can find the probability for a substitution from one time to another and the stationary distribution.
What is a stationary distribution?
A stationary distribution shows the frequency of nucleotides when substitution does not occur for a long time.
How do you find the probability for a substitution from t to t+1 in under the JC69 model?
The JC assumes that all frequencies and rates are equal.
So the probability for a substitution at time t+1 equals the probability at time t multiplied with the substitution rate matrix.
[pA(t+1) pC(t+1) pG(t+1) pT(t+1)] = [pA(t) pC(t) pG(t) (pt(t)] x probability matrix (the column).
It describes the probability of change over a short period of time.
How do you solve for the stationary distribution under the JC69 model?
When you solve for the stationary distribution you solve for the frequency of the nucleotides in a state when they don’t change under a long period of time. Meaning that you assume that probability to have ACTG are the same at t and t+1.
This states that when you multiply the stationary distribution with the probability matrix you get the same stationary distribution back, signifying that the probabilities stay the same under the jukes-cantor model pA(t+1) = pA(t). So, the stationary distribution under the jukes-cantor model is 0.25 for each nucleotide.
stationary distribution = stationary distribution x rate matrix.
Describe the Kimura model
The K80 model is a bit more complex than the jukes-cantor.
This is because it assumes that there are different substitutions rates between transitions(A-G/C-T) and transversions (A-T/C-G) which gives two parameters in the substitution rate matrix.
It assumes that the frequencies are equal and it does not model for for variation in rate across different sites in the sequence.
Describe the Felstein model
All substitutions rates are equal.
It allows for variance in nucleotide frequencies.
Describe the HKY85 model of substitutions
It gives different rates to transitions and transversions by giving the parameter kappa the transition-to-transversion-ratio.
The model gives the ability to specify the base frequencies and can therefore accommodate for situations where frequencies vary
Describe the general time reversible model of substitution
Allows for different substitution rate for all 6 substitutions.
Allows for unequal base frequencies.
It is time reversible: In a time-reversible model the rate at which a particular nucleotide is replaced by another is the same as the rate at which the reverse substitution occurs.
There is one thing that none of the substitution models alone can model for, what is it? How do we correct for it?
Different rate of substitutions on different sites of sequences.
Not all parts of the gene evolve in the same rate.
To correct for it we use the gamma distribution.
What is the setback of using the Gamma distribution?
It does not allow for correlations in rate along the genome or changes in rate across time.
Rate variation is often correlated along genome and the rate often changes with time in genes.
What information can we get from looking at a phylogenetic tree?
Phylogenetic trees help us retrace history and relationships between individuals or groups.
When is it appropriate to reconstruct a phylogenetic tree?
The phylogenetic reconstruction of a tree is done after you have:
- gathered your data
- assembled a dataset
- perform MSA
- checked quality of MSA
What are the basic assumptions for phylogenetic reconstructions?
The sequences in a tree share a common ancestor
- Mutations are accumulated from the common ancestor
- Mutations are relatively rare. If A-B are more similar to each other in a alignment than to C, then A-B probably have a common ancestor that C does not have.
Define the following terms:
- Paraphylectic group
- monophyletic group
- Polyphyletic group
- Homoplasies
Monophyletic groups = an ancestral lineage and all the descendants of that lineage. A group of organisms that consists of a common ancestor and all its descendants.
Paraphyletic group = Include ancestor and some but not all descendants of a common ancestor.
Polyphyletic group = organisms that have converged on a similar characteristic but do not share a common ancestor.
Homoplasies = Characteristics that are similar but have not been inherited from the same ancestor are known as homoplasies. The developments and phenomenon of homoplasies is known as convergent evolution.
Define the following terms:
- Polytomy
- Bifurcating tree
- Orthologues
- Paralogous
Polytomy = A polytomy is a node with more than three branches connecting to it.
Bifurcating tree = A tree without polytomous nodes, the tree is fully resolved.
Orthologues = Orthologous are two genes in two different taxa that share a common ancestor and have the same function.
Paralogous = two genes in the same genome that are the product of a gene duplication event of the original gene.
What are the distance based methods for retrieving a phylogenetic tree from multiple sequence alignment data?
UPGMA and Neighbor joining
What algorithms for retrieving a phylogenetic tree are there?
Maximum parsimony
Maximum likelihood
Bayesian statistics
UPGMA
NJ
What does the UPGMA algorithm do?
The UPGMA algorithm looks at a matrix of pairwise distance from sequence alignment of n number of taxa and based on the distances and branch length the algorithm creates a rooted phylogenetic tree of the different taxa.
What are the steps of the UPGMA algorithm?
Create distance matrix
2.Join taxa with smallest distance to cluster
3.Calculate branch length from joined taxa to new node
4.Calculate distances from new node
Repeat until done.
How do you calculate the new distance matrix after you’ve joined the taxons with the smallest distance in UPGMA?
Once the elements have joined to a new operational unit called a cluster you calculate the distance from the cluster to the remaining elements by calculating the average of the distance between the cluster and each element.
Since the cluster contain the joined elements you take the average of the distances between both joined elements to the remaining element.
D(u,c) = D(a,c) + D(b,c) / 2