Matching Flashcards by Isaac Dunford

Why are sequences matched?

Protein sequences are similar between related species

Can transfer knowledge about known sequences to new sequences

How well did you know this?

Not at all

Perfectly

Alignment

First task after sequencing

Finds best match at DNA and protein level

How well did you know this?

Not at all

Perfectly

Difference between protein and DNA mutations

Selective pressure affects DNA and proteins differently
Protein mutations - Can lead to in viable offspring - Highly conserved for protein sequences
DNA mutations - Can have more mutations with no changes in function - Less conserved than proteins

How well did you know this?

Not at all

Perfectly

Conserved Sequences

Conserved:Does not change between different generations, species or even kingdoms
More important sequences are generally more conserved

How well did you know this?

Not at all

Perfectly

Universally Conserved Sequences

Found in all living things
Proteins for transcription and translation

Conserved from the last “universal common ancestor” of all life

How well did you know this?

Not at all

Perfectly

Degeneracy in genetic code

Can change whilst still coding for the same amino acid

How well did you know this?

Not at all

Perfectly

Use of conservation in sequence similarity

More similar amino acids are more likely to be substituted

Used to work out how similar two sequences are

How well did you know this?

Not at all

Perfectly

Histones

Highly conserved

Involved in DNA replication

How well did you know this?

Not at all

Perfectly

BLOSOM 50

Block Substitution Matrix
Using known alignments, calculate likelihood that one amino acid is substituted with another

Generated by eliminating sequences with higher than a given similarity
Find frequency of each amino acid combination
Create ratio of occurrences of each of the pairs
Calculate log-odds of each pair - 2 * log of observed\expected (How likely is a match just random chance?)

Each entry in the matrix corresponds to the probability of one amino acid mutating into another

How well did you know this?

Not at all

Perfectly

Gaps in Sequence Matching

Often good matches are found with insertions/deletions from a string
Linear Model - Poor - Small gaps just as likely as large gaps

Instead use edit distance/levenshtein distance
Local Matching/Global Matching

How well did you know this?

Not at all

Perfectly

Dynamic programming for inexact string matches

Not possible to exhaustively try all solutions
Build set of optimal partial solutions

Not dynamic or programming

How well did you know this?

Not at all

Perfectly

Complexity of Dynamic Programming Approach

Need to compute cost at each point in the forward algorithm so O(n^2)
Only one calculation needed at each column in the backward algorithm so O(n)

Brute force method would require O(2^n)

How well did you know this?

Not at all

Perfectly

Needleman Wunsch

Global alignment
Minimum edit distance between two strings
Assumes linear gap cost for insertions/deletions
Cost of substitutions taken from BLOSOM-50 matrix
Matrix with score for best alignment

How well did you know this?

Not at all

Perfectly

NW Complexity

Time = Theta(MN) - Not too bad
Space = Theta(MN) - DNA has sequences which are millions of bases long so will become an issue

How well did you know this?

Not at all

Perfectly

Hirschberg Algorithm

Recursively splits up the matching
Linear in space
Still quadratic in time - Still almost doubles the amount of time needed

How well did you know this?

Not at all

Perfectly

Differences between global and local matching

Study These Flashcards

Global - Aligning all of the sequences with each other - NW

Local - Only finding the best matching sub sequences - Some sections are accidentally copied - SW

Smith-Waterman

Study These Flashcards

Also dynamic programming approach
Adds option of ignoring a match (cost 0) to NW
Optimal for local matching but slow on long strings

SW repeated and overlapping matches

Study These Flashcards

SW finds the single best match
SW can also be adapted to deal with this
Unsymmetric as looking for repeats of one section within another.

Gaps in local matching

Study These Flashcards

Insertion - Adding another subsequence from elsewhere
Deletion - Jumping bases or treating an exon as an intron

Both of these are rare events but lead to large gaps - Having a linear gap cost is unrealistic

Gap cost function evaluated over all previous possible gap sizes

Study These Flashcards

Time Complexity = O(n^3)

Affine Gap Cost Function

Study These Flashcards

y(g) = -d - (g-1) e
d = 12 e = 3 gives a good model
Punishes all gaps but larger gaps are punished slightly more
Can be O(n^2)
Requires 3 matrices

Affine gap cost function - Finite State Automata

Study These Flashcards

Each state represents a matrix

Possible for an approximation of the affine gap cost with a two state model

Inexact matching in practice - Proteins

Study These Flashcards

O(mn) possible - hundreds of amino acids long
Over 200 million sequenced
Difficult to compare sequence with each o ne

Inexact matching in practice - DNA

Study These Flashcards

100000 bases long
O(mn) impossible
Need to do approximate matching

FASTA

FAST ALL (DNA and Protein alphabets) Not optimal - Use of heuristics so still good Find series of short non-overlapping subsequences known as words Match against database of candidate sequences to find regions with same offset Smith-Waterman against regions found BLAST mostly used now but FASTA file format still used

BLAST

Basic Local Alignment Search Tool Not optimal - uses heuristics so still good Removes low complexity/repeated regions K-letter exact matches? Only evaluate most significant matches Find gaps by combining high scoring matches Finishing off with Smith Waterman Need to know how significant matches are

Significance Scoring

BLOSOM50 uses log odds ratio For local matching need probability of a match given sequences a and b. Can use bayes rule here EVDs

Extreme Value Distributions

How likely it is that a particular score was random rather than due to a match More accurate with longer strings Characterised as a Gumbel distribution

Matching Flashcards

(28 cards)