8. BLAST Flashcards
What pairwise alignment approaches are there?
dynamic programming (local, global)
heuristics
Dynamic programming approaches for pairwise alignment:
What do they guarantee?
Are they feasible?
guarantee the optimal alignment for two sequences
Work for small applications, but not feasible for big datasets - too slow or memory intensive for many applications
However, part of many approaches/programs!
What are two heuristics for alignment that are based on dynamic programming?
BLAST, FASTA.
BLAST is a heuristic.
What are two dynamic programming algorithms which BLAST takes some inspiration from?
Can they be used for genomescale alignments or big datasets?
1st DP for sequence analysis: NeedlemanWunsch (global alignments)
Then: SmithWaterman (local alignments).
too computationally intensive for big datasets
Heuristics make alignments practical for big databases/ datasets eg:
- NCBI
- from NextGeneration Sequencing (NGS).
Heuristic approaches for pairwise alignment (vs DP):
Basic idea to speed up?
exclude unpromising regions to speed up the search
Heuristic approaches for pairwise alignment (vs DP):
Name 4 approaches to allow speeding up?
Name 3 Applications?
Approaches:
- suffix trees
- dotplots
- pre-processing
- using substitution matrices, etc
Applications:
- lots of short NGS reads & reference: read mapping
- very long sequences: genome alignment
- lots of (gene) sequences: database search
MUMmer: genome alignment
stands for?
Maximal Unique Matches between strains: series of matches in the same order, translocations
MUMmer: genome alignment
goal?
data structure used?
goal:
globally compare two (similar) genomes
data structure: suffix tree
MUMmer: genome alignment
use/aim?
approach?
speed?
used for comparing different genomes assemblies to one another, which allows scientists to determine how a genome has changed
approach:
* construct a suffix tree of a reference sequence
* stream a query sequence against the suffix tree
* identify short exact matches between the two sequences
- use these directly
- extend to longer inexact alignments
* very fast!
Sequence alignment starting with:
one sequence vs many sequences
we want to?
database search - find database sequences that are similar (homologous) to the query sequence:
- identify similar (regions between) sequences
- collect related sequences for comparative analysis
- (taxonomically) identify a sequence (fragment)
- …
What are the results of a database search?
- alignment(s) between a query sequence and one or more database sequences
- ranked by statistical significance (E-value)
BLAST
basic steps that you do before BLAST search?
You:
* select query sequence
* select database to be searched
* pre-process / format database
* decide which BLAST program to use
* select parameters for search
* execute BLAST search
* interpret biological significance
BLAST:
basic steps that BLAST does?
- seeding
- extension to a good longer alignment
- evaluation of statistical significance
- presentation of ranked alignments
What BLAST programs are there?
QUERY / DATABASE / PROGRAM nucleotide / nucleotide / blastn nucleotide (translated) / nucleotide (translated) / tblastx nucleotide (translated) / peptide / blastx peptide / peptide / blastp peptide / nucleotide (translated) / tblastn
What BLAST variants are there?
- MEGABLAST: find highly similar DNA sequences
- PSI-BLAST: find distant members of a protein family or build a custom position-specific score matrix
- and many more
How BLAST works: seeding
What are used as seeds and why?
compile short words from query that provide high scores, look for identical matches to these words in the database
the locations of all high scoring word neighbors (word hits) in the db sequences are identified and used as alignment seeds
Why?
- inexact matching is slow, exact matching is fast
- biologically significant matches contain short regions of identities or high-scoring matches
How BLAST works: seeding
What counts as a word hit?
- Query is broken down into short overlapping words - default: 11 nt or 3 (5) aa
- up to 50 high scoring word neighbors with minimum score T are determined for each word in the query
How BLAST works: seeding
eg which scoring matrix?
BLOSUM62
What is the most time-consuming step of BLAST?
extension
BLAST:
What is the name of the algorithm for extension?
two-hit algorithm (1997)