Sequence alignment Flashcards
1
Q
Search for sequence similarity
A
- one of the most common tasks
- why? some amino acids may be replaced by other amino acids
- similar may equal to similar function
- gaps may occur
- insertions and deletions [-]
2
Q
Algorithms for sequence similarity
A
- able to find the best alignments
- deal with gaps and mismatches
- not all mismatches are the same, some are more important and should hold more weight
- matches should also have a different weight on the alignment [: identical, . similar]
3
Q
Evaluating the best alignment
A
- scoring/substitution matrix is needed, score for every pair of aminoacids
- objective function, computational evaluation of the alignment
- ∑(i=1 ->L) s(a_i, b_i ) − ∑(j=1 ->G) γ+δ(len(j)−1))
- s score between aminoacids
- L lenght of alignment
- G
- γ fixed penalty
- δ
- algorithm maximizes the value of the obj function
4
Q
Types of scoring matrices?
A
- PAM matrix
- Margaret Dayhoff, 1978
- calculate the probability in a small evolutionary step (1% of amino acid substitutions) that any amino acid would change or stay the same [PAM1]
- extended to longer evolutionary periods (240 steps, PAM240)
- Blosum matrix
- used BLOCK database, alignments of most conserved regions of protein families
- only sequences with higher percentage of identity used
- blocks with 60% identity [Blosum60]
- for both used the log of the ratio divided by the probability of a casual alignment
- indicates probability for the pairing not to be casual but because of homology
- s(a,b)= int(10*log(M(a,b)/C(a,b)))
- M probability matrix
- C casual probability matrix
5
Q
Dot matrix
A
- easy way to visualize alignments
- black dots indicate matches
- white no matches
6
Q
Sequence alignment by dynamic programming
A
- automatically calculate the path within the dot plot, giving the top score (best alignment)
- alignments cannot go back, search only three positions (bottom, right and bottom-right diagonal)
- going from top left to right ensures that we always know the score of the alignment for all possible options. We choose the highest
7
Q
Smith & Waterman algorithm
A
- based on dynamic programming
- returns best alignment
- O(n*m), n and m lenghts of sequences
- can be used both on amino acid and nucleic acid sequences
- too slow to search in a large database
- constructs a scoring matrix used to trace-back to possible best alignment
- any negative region set to 0
8
Q
Heuristics approaches for pairwise sequence alignment
A
- faster than dynamic programming
- return excellent alignments, no guarantee of optimum
- based of search for words (k-tuple)
- useful in large-scale searches
9
Q
Heuristics general strategy
A
- Two main steps
- lookup table, k-mers indexed and positions saved
- read second sequence and for every k-mer check were it occurs in table, try to extend alignment
- kmers can be converted to indegers
- a^k possible words
- A=00, C=01, G=10, T=11
- sequence of coded basis is converted to integer
- sequence of integers is converted to another integer
10
Q
FASTA
A
- one sequence by axis at top left
- growing diagonals
- identify regions of identity
- scan regions with scoring matrix and save best initial regions
- join regions with scores > threshold
- recalculate and optimize alignments around highest initial scoring region
- lenght is ktup, number of amino acids considered
- for nucleic acids is generally 6
11
Q
BLAST
A
- most popular
- produces an index of similar words < threshold, not only identical
- tries to extend locally each “seed”, to obtain a highscoring segment pair
12
Q
BLAT
A
- similar to Blast but keeps index of entire genome in memory
- non-overlapping 11-mers
- designed to quickly find sequences of 95% and greater similarity of length 40 bases or more
- may miss more divergent or shorter sequence alignments