Sequence Alignment and Searching Flashcards
Week 3 Lecture 1
What are homologues?
Evolutionarily-related proteins
Two types: orthologs and paralogs
How do protein sequences evolve?
- Substitutions due to single-base mutations
- Insertions or deletions of residues - usually in the connecting loops (not the secondary structures)
- Indels make it harder to compare sequences (need to line up the equivalent regions and put gaps where there are indels)
The formula for % sequence identity
(no. of identical residues/no. of residues in smallest protein) x100
How do you search sequence databases?
- Do fast scans using approximate methods (BLAST or PSI-BLAST)
- Align proteins carefully using a dynamic programming method
- Scan against sequence profiles/HMMs in secondary databases
- Align query sequences against family relatives
Tuple size
Runs of identical residues (at least 3 in a row)
Window (path matrix)
The two red bars on either side of the matrix. The window is a certain distance not too far from the centre diagonal.
Score (path matrix)
The score of the path (watch yt video)
Types of residue substitution matrices
- Identity matrix
- Physicochemical properties matrix
- Evolutionary matrix
Physicochemical properties matrix
Score residue pairs according to similarities in their physicochemical properties
Identity matrix
Simplest scoring scheme - amino acids are either identical (1) or non-identical (0)
Evolutionary matrices
Score residue pairs according to how frequently the mutation is observed to occur in evolution
Dayhoff matrix
- Based on evolutionary relationships, it is based on analysing the substitutions observed in closely related sequences (>80% identity)
- The method measures evolutionary distance by determining the number of point-accepted mutations
BLOSUM substitution matrices
- The matrix is derived from analysing substitution patterns in more distant relatives (<85% sequence identity)
- For clusters of related sequences derive multiple alignments without gaps
- For short regions of related sequences use the alignments to calculate residue substitution frequencies
How do we know which matrix to use?
- Matrices derived from observed substitution data (e.g. BLOSUM) are better than identity matrices or those based on physical properties
- In database searching it may be best to use PAM120 or BLOSUM62
- Various studies suggest that PAM250 gives the best result when aligning distant proteins using dynamic programming algorithms
Needleman & Wunsch Algorithm steps in dynamic programming
- Score the path matrix
- Accumulate scores in the path matrix
- Trace the highest-scoring path in the path matrix