Matching Flashcards
Why are sequences matched?
Protein sequences are similar between related species
Can transfer knowledge about known sequences to new sequences
Alignment
First task after sequencing
Finds best match at DNA and protein level
Difference between protein and DNA mutations
Selective pressure affects DNA and proteins differently
Protein mutations - Can lead to in viable offspring - Highly conserved for protein sequences
DNA mutations - Can have more mutations with no changes in function - Less conserved than proteins
Conserved Sequences
Conserved:Does not change between different generations, species or even kingdoms
More important sequences are generally more conserved
Universally Conserved Sequences
Found in all living things
Proteins for transcription and translation
Conserved from the last “universal common ancestor” of all life
Degeneracy in genetic code
Can change whilst still coding for the same amino acid
Use of conservation in sequence similarity
More similar amino acids are more likely to be substituted
Used to work out how similar two sequences are
Histones
Highly conserved
Involved in DNA replication
BLOSOM 50
Block Substitution Matrix
Using known alignments, calculate likelihood that one amino acid is substituted with another
Generated by eliminating sequences with higher than a given similarity
Find frequency of each amino acid combination
Create ratio of occurrences of each of the pairs
Calculate log-odds of each pair - 2 * log of observed\expected (How likely is a match just random chance?)
Each entry in the matrix corresponds to the probability of one amino acid mutating into another
Gaps in Sequence Matching
Often good matches are found with insertions/deletions from a string
Linear Model - Poor - Small gaps just as likely as large gaps
Instead use edit distance/levenshtein distance
Local Matching/Global Matching
Dynamic programming for inexact string matches
Not possible to exhaustively try all solutions
Build set of optimal partial solutions
Not dynamic or programming
Complexity of Dynamic Programming Approach
Need to compute cost at each point in the forward algorithm so O(n^2)
Only one calculation needed at each column in the backward algorithm so O(n)
Brute force method would require O(2^n)
Needleman Wunsch
Global alignment
Minimum edit distance between two strings
Assumes linear gap cost for insertions/deletions
Cost of substitutions taken from BLOSOM-50 matrix
Matrix with score for best alignment
NW Complexity
Time = Theta(MN) - Not too bad Space = Theta(MN) - DNA has sequences which are millions of bases long so will become an issue
Hirschberg Algorithm
Recursively splits up the matching
Linear in space
Still quadratic in time - Still almost doubles the amount of time needed
Differences between global and local matching
Global - Aligning all of the sequences with each other - NW
Local - Only finding the best matching sub sequences - Some sections are accidentally copied - SW
Smith-Waterman
Also dynamic programming approach
Adds option of ignoring a match (cost 0) to NW
Optimal for local matching but slow on long strings
SW repeated and overlapping matches
SW finds the single best match
SW can also be adapted to deal with this
Unsymmetric as looking for repeats of one section within another.
Gaps in local matching
Insertion - Adding another subsequence from elsewhere
Deletion - Jumping bases or treating an exon as an intron
Both of these are rare events but lead to large gaps - Having a linear gap cost is unrealistic
Gap cost function evaluated over all previous possible gap sizes
Time Complexity = O(n^3)
Affine Gap Cost Function
y(g) = -d - (g-1) e d = 12 e = 3 gives a good model Punishes all gaps but larger gaps are punished slightly more Can be O(n^2) Requires 3 matrices
Affine gap cost function - Finite State Automata
Each state represents a matrix
Possible for an approximation of the affine gap cost with a two state model
Inexact matching in practice - Proteins
O(mn) possible - hundreds of amino acids long
Over 200 million sequenced
Difficult to compare sequence with each o ne
Inexact matching in practice - DNA
100000 bases long
O(mn) impossible
Need to do approximate matching
FASTA
FAST ALL (DNA and Protein alphabets)
Not optimal - Use of heuristics so still good
Find series of short non-overlapping subsequences known as words
Match against database of candidate sequences to find regions with same offset
Smith-Waterman against regions found
BLAST mostly used now but FASTA file format still used
BLAST
Basic Local Alignment Search Tool
Not optimal - uses heuristics so still good
Removes low complexity/repeated regions
K-letter exact matches?
Only evaluate most significant matches
Find gaps by combining high scoring matches
Finishing off with Smith Waterman
Need to know how significant matches are
Significance Scoring
BLOSOM50 uses log odds ratio
For local matching need probability of a match given sequences a and b. Can use bayes rule here
EVDs
Extreme Value Distributions
How likely it is that a particular score was random rather than due to a match
More accurate with longer strings
Characterised as a Gumbel distribution