Substitution matrices Flashcards
Why are not all missense mutations equal?
Different qualities of amino acid e.g. polar, non-polar
Alignment of sequences
Gaps inserted to maximise alignment
Haemoglobin vs Myoglobin
Similar but not the same
Slide proteins along to identify similarities
Plot matches
Insertion of gaps
Insertion of gaps in sequence allows for a greater overall identity by more amino acid matches
Forming empirical substitution matric from an alignment
V: G, A, A, L, L, K, I, P, K, Q, T, A, F, D
= 3A, 2L, 2K, 1G, 1I, 1P, 1Q, 1T, 1F, 1D
– total 14 substitutions from 20 occurrences
Valine 70% substituted
21% are A; 14% each are L, K; 7% are others
L: N, K, K, K, E, V, A, V, I, S, H, F, I, I, I, H, M, S, M
= 4I, 3K, 2H, 2V, 2M, 2S, 1F, 1N, 1E, 1A
- total 19 substitutions from 31 occurrences
Leucine 61% substituted
21% are I; 16% K; 11% are H,V,M,S; 5% each others
Point accepted mutations (PAM) matrices
- Takes pair of orthologous sequences from two species where you know the date of their common ancestor
- Repeat what was just shown for all amino acids, to compile an empirical substitution matrix
- PAM1 = a PAM matrix made from species with 1 million years divergence
- Likewise PAM50, PAM500
- Choose the appropriate matrix depending on the species you are aligning
Blocks summary (BLOSUM) matrices
Blocks Summary (BLOSUM) matrices
- Based on the now defunct Blocks aligner and its curated database Blocks+
- From the Blocks+ database, select all the alignments
- Choose a threshold the thin them
- For instance, only retain sequences that are 62% identical, any sequence that is not at least 62% identical to one of the other sequences is discarded
- On second pass, remove sequences that are not at least 62% identical to all the other sequences
- Then empirically assemble the substitution matrix from the alignments that remain.
- BLOSUM-62
- Likewise BLOSUM-90 (using a 90% threshold)
- Note that numbers go in opposite direction to PAM numbers
BLOSUM-62 substitution matrix for proteins
- Common substitutions score highly
- Rare substitutions score lowly (negative values)
How are mismatches scored?
- On a variable scale - depends on likelihood of mismatch
Likelihood of mismatch determined from scoring curated datasets eg.
Peptide sequences from species that diverged X millions of years ago – the Point Accepted Mutation series: PAM1, PAM50, PAM500
Peptide sequences with a certain degree of similarity on alignment with BLOCKS – the Blosum series: Blosum-62 is the substitution matrix derived from a set of sequences that are at least 62% identical, Blosum-90 from at least 90% identical etc
Others: Gonnet, Jones-Taylor-Thornton (JTT), Whelan & Goldman (WAG)