Matching Flashcards
Why are sequences matched?
Protein sequences are similar between related species
Can transfer knowledge about known sequences to new sequences
Alignment
First task after sequencing
Finds best match at DNA and protein level
Difference between protein and DNA mutations
Selective pressure affects DNA and proteins differently
Protein mutations - Can lead to in viable offspring - Highly conserved for protein sequences
DNA mutations - Can have more mutations with no changes in function - Less conserved than proteins
Conserved Sequences
Conserved:Does not change between different generations, species or even kingdoms
More important sequences are generally more conserved
Universally Conserved Sequences
Found in all living things
Proteins for transcription and translation
Conserved from the last “universal common ancestor” of all life
Degeneracy in genetic code
Can change whilst still coding for the same amino acid
Use of conservation in sequence similarity
More similar amino acids are more likely to be substituted
Used to work out how similar two sequences are
Histones
Highly conserved
Involved in DNA replication
BLOSOM 50
Block Substitution Matrix
Using known alignments, calculate likelihood that one amino acid is substituted with another
Generated by eliminating sequences with higher than a given similarity
Find frequency of each amino acid combination
Create ratio of occurrences of each of the pairs
Calculate log-odds of each pair - 2 * log of observed\expected (How likely is a match just random chance?)
Each entry in the matrix corresponds to the probability of one amino acid mutating into another
Gaps in Sequence Matching
Often good matches are found with insertions/deletions from a string
Linear Model - Poor - Small gaps just as likely as large gaps
Instead use edit distance/levenshtein distance
Local Matching/Global Matching
Dynamic programming for inexact string matches
Not possible to exhaustively try all solutions
Build set of optimal partial solutions
Not dynamic or programming
Complexity of Dynamic Programming Approach
Need to compute cost at each point in the forward algorithm so O(n^2)
Only one calculation needed at each column in the backward algorithm so O(n)
Brute force method would require O(2^n)
Needleman Wunsch
Global alignment
Minimum edit distance between two strings
Assumes linear gap cost for insertions/deletions
Cost of substitutions taken from BLOSOM-50 matrix
Matrix with score for best alignment
NW Complexity
Time = Theta(MN) - Not too bad Space = Theta(MN) - DNA has sequences which are millions of bases long so will become an issue
Hirschberg Algorithm
Recursively splits up the matching
Linear in space
Still quadratic in time - Still almost doubles the amount of time needed