Sequence alignment Flashcards

1
Q

Search for sequence similarity

A
  • one of the most common tasks
  • why? some amino acids may be replaced by other amino acids
    • similar may equal to similar function
  • gaps may occur
    • insertions and deletions [-]
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Algorithms for sequence similarity

A
  • able to find the best alignments
  • deal with gaps and mismatches
    • not all mismatches are the same, some are more important and should hold more weight
    • matches should also have a different weight on the alignment [: identical, . similar]
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Evaluating the best alignment

A
  • scoring/substitution matrix is needed, score for every pair of aminoacids
  • objective function, computational evaluation of the alignment
    • ∑(i=1 ->L) s(a_i, b_i ) − ∑(j=1 ->G) γ+δ(len(j)−1))
    • s score between aminoacids
    • L lenght of alignment
    • G
    • γ fixed penalty
    • δ
  • algorithm maximizes the value of the obj function
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Types of scoring matrices?

A
  • PAM matrix
    • Margaret Dayhoff, 1978
    • calculate the probability in a small evolutionary step (1% of amino acid substitutions) that any amino acid would change or stay the same [PAM1]
    • extended to longer evolutionary periods (240 steps, PAM240)
  • Blosum matrix
    • used BLOCK database, alignments of most conserved regions of protein families
    • only sequences with higher percentage of identity used
    • blocks with 60% identity [Blosum60]
  • for both used the log of the ratio divided by the probability of a casual alignment
    • indicates probability for the pairing not to be casual but because of homology
    • s(a,b)= int(10*log(M(a,b)/C(a,b)))
    • M probability matrix
    • C casual probability matrix
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Dot matrix

A
  • easy way to visualize alignments
    • black dots indicate matches
    • white no matches
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Sequence alignment by dynamic programming

A
  • automatically calculate the path within the dot plot, giving the top score (best alignment)
  • alignments cannot go back, search only three positions (bottom, right and bottom-right diagonal)
  • going from top left to right ensures that we always know the score of the alignment for all possible options. We choose the highest
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Smith & Waterman algorithm

A
  • based on dynamic programming
  • returns best alignment
  • O(n*m), n and m lenghts of sequences
  • can be used both on amino acid and nucleic acid sequences
  • too slow to search in a large database
  • constructs a scoring matrix used to trace-back to possible best alignment
    • any negative region set to 0
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Heuristics approaches for pairwise sequence alignment

A
  • faster than dynamic programming
    • return excellent alignments, no guarantee of optimum
    • based of search for words (k-tuple)
    • useful in large-scale searches
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Heuristics general strategy

A
  • Two main steps
    • lookup table, k-mers indexed and positions saved
    • read second sequence and for every k-mer check were it occurs in table, try to extend alignment
  • kmers can be converted to indegers
    • a^k possible words
    • A=00, C=01, G=10, T=11
    • sequence of coded basis is converted to integer
    • sequence of integers is converted to another integer
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

FASTA

A
  • one sequence by axis at top left
    • growing diagonals
    • identify regions of identity
    • scan regions with scoring matrix and save best initial regions
    • join regions with scores > threshold
    • recalculate and optimize alignments around highest initial scoring region
  • lenght is ktup, number of amino acids considered
    • for nucleic acids is generally 6
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

BLAST

A
  • most popular
  • produces an index of similar words < threshold, not only identical
  • tries to extend locally each “seed”, to obtain a highscoring segment pair
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

BLAT

A
  • similar to Blast but keeps index of entire genome in memory
  • non-overlapping 11-mers
  • designed to quickly find sequences of 95% and greater similarity of length 40 bases or more
  • may miss more divergent or shorter sequence alignments
How well did you know this?
1
Not at all
2
3
4
5
Perfectly