Sequence Alignment and Searching Flashcards
Week 3 Lecture 1
What are homologues?
Evolutionarily-related proteins
Two types: orthologs and paralogs
How do protein sequences evolve?
- Substitutions due to single-base mutations
- Insertions or deletions of residues - usually in the connecting loops (not the secondary structures)
- Indels make it harder to compare sequences (need to line up the equivalent regions and put gaps where there are indels)
The formula for % sequence identity
(no. of identical residues/no. of residues in smallest protein) x100
How do you search sequence databases?
- Do fast scans using approximate methods (BLAST or PSI-BLAST)
- Align proteins carefully using a dynamic programming method
- Scan against sequence profiles/HMMs in secondary databases
- Align query sequences against family relatives
Tuple size
Runs of identical residues (at least 3 in a row)
Window (path matrix)
The two red bars on either side of the matrix. The window is a certain distance not too far from the centre diagonal.
Score (path matrix)
The score of the path (watch yt video)
Types of residue substitution matrices
- Identity matrix
- Physicochemical properties matrix
- Evolutionary matrix
Physicochemical properties matrix
Score residue pairs according to similarities in their physicochemical properties
Identity matrix
Simplest scoring scheme - amino acids are either identical (1) or non-identical (0)
Evolutionary matrices
Score residue pairs according to how frequently the mutation is observed to occur in evolution
Dayhoff matrix
- Based on evolutionary relationships, it is based on analysing the substitutions observed in closely related sequences (>80% identity)
- The method measures evolutionary distance by determining the number of point-accepted mutations
BLOSUM substitution matrices
- The matrix is derived from analysing substitution patterns in more distant relatives (<85% sequence identity)
- For clusters of related sequences derive multiple alignments without gaps
- For short regions of related sequences use the alignments to calculate residue substitution frequencies
How do we know which matrix to use?
- Matrices derived from observed substitution data (e.g. BLOSUM) are better than identity matrices or those based on physical properties
- In database searching it may be best to use PAM120 or BLOSUM62
- Various studies suggest that PAM250 gives the best result when aligning distant proteins using dynamic programming algorithms
Needleman & Wunsch Algorithm steps in dynamic programming
- Score the path matrix
- Accumulate scores in the path matrix
- Trace the highest-scoring path in the path matrix
How do we accumulate scores in the path matrix?
- Start at the bottom right
- Move right to left accumulating scores
- Move up the next row
How does BLAST work?
- A highest-scoring segment pair is found between two sequences
- The sequences may be related if HSP score >cutoff
1. Match significant words
2. Compare the word list to the database and identify exact matches
3. For each word match, extend the alignment using a PAM matrix and dynamic programming - BLAST searches for 2 non-overlapping segments on the same diagonal. They must be within a certain distance of each other before the extension is invoked. It can also allow gaps so that the method joins segments on different diagonals.
How do we assess the significance of a sequence match?
- Length - we can get artificially high scores between small sequences
- Composition - if sequences are rich in particular amino acid residues we can get high scores for unrelated proteins
- To assess the significance of a match it is necessary to compare the score with that returned by random or unrelated sequences
- If the database is small or when considering a pair-wise comparison, the sequences can be shuffled to generate random sequences
S (BLAST)
Score for the pairwise alignment
E-value (BLAST)
Number of expected hits by chance with score S or higher given the size of the database and the length of the alignment
How do you conduct a Multiple Sequence Alignment
- Align the most closely related pairs using DP and gradually align these groups together keeping the gaps that appear in earlier alignments fixed
- (or) Add sequences one at a time to a growing multiple alignment