Matching Flashcards

1
Q

Why are sequences matched?

A

Protein sequences are similar between related species

Can transfer knowledge about known sequences to new sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Alignment

A

First task after sequencing

Finds best match at DNA and protein level

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Difference between protein and DNA mutations

A

Selective pressure affects DNA and proteins differently
Protein mutations - Can lead to in viable offspring - Highly conserved for protein sequences
DNA mutations - Can have more mutations with no changes in function - Less conserved than proteins

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Conserved Sequences

A

Conserved:Does not change between different generations, species or even kingdoms
More important sequences are generally more conserved

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Universally Conserved Sequences

A

Found in all living things
Proteins for transcription and translation

Conserved from the last “universal common ancestor” of all life

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Degeneracy in genetic code

A

Can change whilst still coding for the same amino acid

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Use of conservation in sequence similarity

A

More similar amino acids are more likely to be substituted

Used to work out how similar two sequences are

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Histones

A

Highly conserved

Involved in DNA replication

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

BLOSOM 50

A

Block Substitution Matrix
Using known alignments, calculate likelihood that one amino acid is substituted with another

Generated by eliminating sequences with higher than a given similarity
Find frequency of each amino acid combination
Create ratio of occurrences of each of the pairs
Calculate log-odds of each pair - 2 * log of observed\expected (How likely is a match just random chance?)

Each entry in the matrix corresponds to the probability of one amino acid mutating into another

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Gaps in Sequence Matching

A

Often good matches are found with insertions/deletions from a string
Linear Model - Poor - Small gaps just as likely as large gaps

Instead use edit distance/levenshtein distance
Local Matching/Global Matching

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Dynamic programming for inexact string matches

A

Not possible to exhaustively try all solutions
Build set of optimal partial solutions

Not dynamic or programming

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Complexity of Dynamic Programming Approach

A

Need to compute cost at each point in the forward algorithm so O(n^2)
Only one calculation needed at each column in the backward algorithm so O(n)

Brute force method would require O(2^n)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Needleman Wunsch

A

Global alignment
Minimum edit distance between two strings
Assumes linear gap cost for insertions/deletions
Cost of substitutions taken from BLOSOM-50 matrix
Matrix with score for best alignment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

NW Complexity

A
Time = Theta(MN) - Not too bad
Space = Theta(MN) - DNA has sequences which are millions of bases long so will become an issue
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Hirschberg Algorithm

A

Recursively splits up the matching
Linear in space
Still quadratic in time - Still almost doubles the amount of time needed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Differences between global and local matching

A

Global - Aligning all of the sequences with each other - NW

Local - Only finding the best matching sub sequences - Some sections are accidentally copied - SW

17
Q

Smith-Waterman

A

Also dynamic programming approach
Adds option of ignoring a match (cost 0) to NW
Optimal for local matching but slow on long strings

18
Q

SW repeated and overlapping matches

A

SW finds the single best match
SW can also be adapted to deal with this
Unsymmetric as looking for repeats of one section within another.

19
Q

Gaps in local matching

A

Insertion - Adding another subsequence from elsewhere
Deletion - Jumping bases or treating an exon as an intron

Both of these are rare events but lead to large gaps - Having a linear gap cost is unrealistic

20
Q

Gap cost function evaluated over all previous possible gap sizes

A

Time Complexity = O(n^3)

21
Q

Affine Gap Cost Function

A
y(g) = -d - (g-1) e
d = 12 e = 3 gives a good model
Punishes all gaps but larger gaps are punished slightly more
Can be O(n^2)
Requires 3 matrices
22
Q

Affine gap cost function - Finite State Automata

A

Each state represents a matrix

Possible for an approximation of the affine gap cost with a two state model

23
Q

Inexact matching in practice - Proteins

A

O(mn) possible - hundreds of amino acids long
Over 200 million sequenced
Difficult to compare sequence with each o ne

24
Q

Inexact matching in practice - DNA

A

100000 bases long
O(mn) impossible
Need to do approximate matching

25
Q

FASTA

A

FAST ALL (DNA and Protein alphabets)
Not optimal - Use of heuristics so still good
Find series of short non-overlapping subsequences known as words
Match against database of candidate sequences to find regions with same offset
Smith-Waterman against regions found
BLAST mostly used now but FASTA file format still used

26
Q

BLAST

A

Basic Local Alignment Search Tool
Not optimal - uses heuristics so still good
Removes low complexity/repeated regions
K-letter exact matches?
Only evaluate most significant matches
Find gaps by combining high scoring matches
Finishing off with Smith Waterman

Need to know how significant matches are

27
Q

Significance Scoring

A

BLOSOM50 uses log odds ratio
For local matching need probability of a match given sequences a and b. Can use bayes rule here
EVDs

28
Q

Extreme Value Distributions

A

How likely it is that a particular score was random rather than due to a match
More accurate with longer strings
Characterised as a Gumbel distribution