2 - Amino acid scoring matrices and optimal pairwise alignment methods Flashcards
What can you do when a dotplot is cluttered?
- Use a sliding window (eg. 3 nt words)
- Use a threshold (eg 2/3 neighbouring. side dots must be identical)
- Use amino acid residues instead of nucleotides!
Which type of amino acids experience the most substitutions?
Non-polar neutral amino acids, as they are chemically similar.
Give four scoring matrices for protein similarity
- Unit cost: 1 match = 1 point
- Genetic similarity matrices (codon similarity)
- Chemical similarity (amino acid chemistry)
- Empirical matrices (Dayhoff/PAM or BLOSUM): Real data on relative propensity of interchange between amino acids
What is 1 PAM? And therefore, what are two sequences 5 PAM apart?
1 PAM = 1 percent accepted mutation
Two sequences 5 PAM apart are 95% identical. The bigger the PAM number the more divergent the sequences.
What is the BLOSUM matrix?
It’s similar to PAM, but much newer and with more substitutions recorded from BLOCKS database (conserved regions of multiple alignments of proteins)
BLOSUM30: Created from sequences which are below 30% identical etc. (opposite of PAM naming scheme!)
Review probability, starting on page 12 of lecture 2.
How do you denote joint probability? Conditional probability?
Review probability, starting on page 12 of lecture 2.
Joint: P(A, B) = (A | B) / total
P(A, B) = P(B) x P(A)
Conditional: P(A | B) = (A | B) / B
What is the multiplication theorem for conditional probability?
P(D,G) = P(D | G) x P(G)
Matrices (eg. PAM and BLOSUM) of the interchange between various amino acids are calculated from REAL data.
How is this data usually displayed? How are these calculated?
log-odds matrices
= log(P(A,B) / P(A) x P(B))
Where A and B are residues at a position and P(A,B) can be thought of as the joint probability of having the two residues appear at a given site having evolved from a common ancestor over time t.
The probability of seeing A and B together is given by the frequency of A (P(A)) times the frequency of B (P(B)).
If values are positive, it is more likely than random chance to share a common ancestry.
What are two advantages and three disadvantages to the dot matrix method?
Pros
- Good way of visualizing alignments (eg. can see repeat structures)
- Programs can do this
Cons
- Needs visual inspection
- Subjective (devil in the clouds)
- Need something to tell you what the OPTIMAL alignment is
What are two methods for finding the optimal alignment of two sequences?
Give the two used types of the most common method!
Exhaustive: Evaluate all possible alignments and choose the best scoring one. Practically impossible for two sequences.
Dynamic programming algorithm: Time is proportional to N*M where N and M are the lengths of the target (N) and query (M) sequences (MUCH FASTER)
- Needleman-Wunch (global)
- Smith-Waterman (local)
List the pairwise alignment methods
- Dot plots
- Global alignment (NW), used when sequences are co-linear.
- Local alignment (SW), eg. FASTA and BLAST. Can use for mosaic or repetitive proteins (where co-linearity is not necessarily expected).
What is the Needleman-Wunch algorithm?
- Global
- Makes a 2D matrix of similarity values
- Builds new matrix by adding up elements in a systematic manner
- Traces back through the matrix from top left to right, top to bottom over the highest numerical path
What is the Smith-Waterman algorithm?
- Local
- Needs to give a negative penalty to mismatches and gaps (which NW doesn’t).
- Stop extending when score = 0 or less
- The entire matrix (that is created) must be searched for regions with high local similarity
- Keeps cumulative total and no elements are allowed a score less than zero
- Tracing the optimal path starts at the highest score in the matrix