Matching Flashcards

(28 cards)

1
Q

Why are sequences matched?

A

Protein sequences are similar between related species

Can transfer knowledge about known sequences to new sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Alignment

A

First task after sequencing

Finds best match at DNA and protein level

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Difference between protein and DNA mutations

A

Selective pressure affects DNA and proteins differently
Protein mutations - Can lead to in viable offspring - Highly conserved for protein sequences
DNA mutations - Can have more mutations with no changes in function - Less conserved than proteins

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Conserved Sequences

A

Conserved:Does not change between different generations, species or even kingdoms
More important sequences are generally more conserved

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Universally Conserved Sequences

A

Found in all living things
Proteins for transcription and translation

Conserved from the last “universal common ancestor” of all life

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Degeneracy in genetic code

A

Can change whilst still coding for the same amino acid

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Use of conservation in sequence similarity

A

More similar amino acids are more likely to be substituted

Used to work out how similar two sequences are

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Histones

A

Highly conserved

Involved in DNA replication

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

BLOSOM 50

A

Block Substitution Matrix
Using known alignments, calculate likelihood that one amino acid is substituted with another

Generated by eliminating sequences with higher than a given similarity
Find frequency of each amino acid combination
Create ratio of occurrences of each of the pairs
Calculate log-odds of each pair - 2 * log of observed\expected (How likely is a match just random chance?)

Each entry in the matrix corresponds to the probability of one amino acid mutating into another

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Gaps in Sequence Matching

A

Often good matches are found with insertions/deletions from a string
Linear Model - Poor - Small gaps just as likely as large gaps

Instead use edit distance/levenshtein distance
Local Matching/Global Matching

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Dynamic programming for inexact string matches

A

Not possible to exhaustively try all solutions
Build set of optimal partial solutions

Not dynamic or programming

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Complexity of Dynamic Programming Approach

A

Need to compute cost at each point in the forward algorithm so O(n^2)
Only one calculation needed at each column in the backward algorithm so O(n)

Brute force method would require O(2^n)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Needleman Wunsch

A

Global alignment
Minimum edit distance between two strings
Assumes linear gap cost for insertions/deletions
Cost of substitutions taken from BLOSOM-50 matrix
Matrix with score for best alignment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

NW Complexity

A
Time = Theta(MN) - Not too bad
Space = Theta(MN) - DNA has sequences which are millions of bases long so will become an issue
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Hirschberg Algorithm

A

Recursively splits up the matching
Linear in space
Still quadratic in time - Still almost doubles the amount of time needed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Differences between global and local matching

A

Global - Aligning all of the sequences with each other - NW

Local - Only finding the best matching sub sequences - Some sections are accidentally copied - SW

17
Q

Smith-Waterman

A

Also dynamic programming approach
Adds option of ignoring a match (cost 0) to NW
Optimal for local matching but slow on long strings

18
Q

SW repeated and overlapping matches

A

SW finds the single best match
SW can also be adapted to deal with this
Unsymmetric as looking for repeats of one section within another.

19
Q

Gaps in local matching

A

Insertion - Adding another subsequence from elsewhere
Deletion - Jumping bases or treating an exon as an intron

Both of these are rare events but lead to large gaps - Having a linear gap cost is unrealistic

20
Q

Gap cost function evaluated over all previous possible gap sizes

A

Time Complexity = O(n^3)

21
Q

Affine Gap Cost Function

A
y(g) = -d - (g-1) e
d = 12 e = 3 gives a good model
Punishes all gaps but larger gaps are punished slightly more
Can be O(n^2)
Requires 3 matrices
22
Q

Affine gap cost function - Finite State Automata

A

Each state represents a matrix

Possible for an approximation of the affine gap cost with a two state model

23
Q

Inexact matching in practice - Proteins

A

O(mn) possible - hundreds of amino acids long
Over 200 million sequenced
Difficult to compare sequence with each o ne

24
Q

Inexact matching in practice - DNA

A

100000 bases long
O(mn) impossible
Need to do approximate matching

25
FASTA
FAST ALL (DNA and Protein alphabets) Not optimal - Use of heuristics so still good Find series of short non-overlapping subsequences known as words Match against database of candidate sequences to find regions with same offset Smith-Waterman against regions found BLAST mostly used now but FASTA file format still used
26
BLAST
Basic Local Alignment Search Tool Not optimal - uses heuristics so still good Removes low complexity/repeated regions K-letter exact matches? Only evaluate most significant matches Find gaps by combining high scoring matches Finishing off with Smith Waterman Need to know how significant matches are
27
Significance Scoring
BLOSOM50 uses log odds ratio For local matching need probability of a match given sequences a and b. Can use bayes rule here EVDs
28
Extreme Value Distributions
How likely it is that a particular score was random rather than due to a match More accurate with longer strings Characterised as a Gumbel distribution