8. BLAST Flashcards
What pairwise alignment approaches are there?
dynamic programming (local, global)
heuristics
Dynamic programming approaches for pairwise alignment:
What do they guarantee?
Are they feasible?
guarantee the optimal alignment for two sequences
Work for small applications, but not feasible for big datasets - too slow or memory intensive for many applications
However, part of many approaches/programs!
What are two heuristics for alignment that are based on dynamic programming?
BLAST, FASTA.
BLAST is a heuristic.
What are two dynamic programming algorithms which BLAST takes some inspiration from?
Can they be used for genomescale alignments or big datasets?
1st DP for sequence analysis: NeedlemanWunsch (global alignments)
Then: SmithWaterman (local alignments).
too computationally intensive for big datasets
Heuristics make alignments practical for big databases/ datasets eg:
- NCBI
- from NextGeneration Sequencing (NGS).
Heuristic approaches for pairwise alignment (vs DP):
Basic idea to speed up?
exclude unpromising regions to speed up the search
Heuristic approaches for pairwise alignment (vs DP):
Name 4 approaches to allow speeding up?
Name 3 Applications?
Approaches:
- suffix trees
- dotplots
- pre-processing
- using substitution matrices, etc
Applications:
- lots of short NGS reads & reference: read mapping
- very long sequences: genome alignment
- lots of (gene) sequences: database search
MUMmer: genome alignment
stands for?
Maximal Unique Matches between strains: series of matches in the same order, translocations
MUMmer: genome alignment
goal?
data structure used?
goal:
globally compare two (similar) genomes
data structure: suffix tree
MUMmer: genome alignment
use/aim?
approach?
speed?
used for comparing different genomes assemblies to one another, which allows scientists to determine how a genome has changed
approach:
* construct a suffix tree of a reference sequence
* stream a query sequence against the suffix tree
* identify short exact matches between the two sequences
- use these directly
- extend to longer inexact alignments
* very fast!
Sequence alignment starting with:
one sequence vs many sequences
we want to?
database search - find database sequences that are similar (homologous) to the query sequence:
- identify similar (regions between) sequences
- collect related sequences for comparative analysis
- (taxonomically) identify a sequence (fragment)
- …
What are the results of a database search?
- alignment(s) between a query sequence and one or more database sequences
- ranked by statistical significance (E-value)
BLAST
basic steps that you do before BLAST search?
You:
* select query sequence
* select database to be searched
* pre-process / format database
* decide which BLAST program to use
* select parameters for search
* execute BLAST search
* interpret biological significance
BLAST:
basic steps that BLAST does?
- seeding
- extension to a good longer alignment
- evaluation of statistical significance
- presentation of ranked alignments
What BLAST programs are there?
QUERY / DATABASE / PROGRAM nucleotide / nucleotide / blastn nucleotide (translated) / nucleotide (translated) / tblastx nucleotide (translated) / peptide / blastx peptide / peptide / blastp peptide / nucleotide (translated) / tblastn
What BLAST variants are there?
- MEGABLAST: find highly similar DNA sequences
- PSI-BLAST: find distant members of a protein family or build a custom position-specific score matrix
- and many more
How BLAST works: seeding
What are used as seeds and why?
compile short words from query that provide high scores, look for identical matches to these words in the database
the locations of all high scoring word neighbors (word hits) in the db sequences are identified and used as alignment seeds
Why?
- inexact matching is slow, exact matching is fast
- biologically significant matches contain short regions of identities or high-scoring matches
How BLAST works: seeding
What counts as a word hit?
- Query is broken down into short overlapping words - default: 11 nt or 3 (5) aa
- up to 50 high scoring word neighbors with minimum score T are determined for each word in the query
How BLAST works: seeding
eg which scoring matrix?
BLOSUM62
What is the most time-consuming step of BLAST?
extension
BLAST:
What is the name of the algorithm for extension?
two-hit algorithm (1997)
BLAST:
What can be discarded before extension and why?
Which segment pairs are extended?
- most database sequences can be discarded before the extension step (if they don’t have any segment pairs)
- segment pairs on the same diagonal and within c cells of each other are extended in both directions
BLAST:
when does extension stop?
extension proceeds until drop-off from highest score is too great:
- X: value of how much the score is allowed to drop off since the last maximum
- if X is reached, the alignment is trimmed back to the point with the maximum score
BLAST: What happens with extended regions once extension has stopped?
extended regions are joined to form a gapped alignment, or high-scoring segment pair (HSP)
What are HSPs and how are they evaluated?
High-scoring segment pair (HSP): extended regions joined to form a gapped alignment
each HSP score is evaluated
- the statistical significance of HSPs are determined
- are two sequence fragments significantly more similar than expected by chance?
- HSPs are ranked by their E(xpect)-value and reported
BLAST: Evaluation
Which scores need to be evaluated?
What scores does BLAST compare them to and how are these obtained?
each HSP has associated score
- but how good is this score?
- is it significantly better than for random sequences?
–> We need to compare the score of an HSP with scores of random sequences of equal length & composition.
We could:
- evaluate empirically
- evaluate analytically
Empirically not done in practice!
In Evaluation of an alignment, BLAST needs to compare the score of an HSP with scores of random sequences of equal length & composition.
Using what distribution are these scores calculated?
derive theoretically
* best scores: follow Gumbel extreme value distribution
* use to compute the probability of obtaining a score equal to or greater than a given score by chance
* Karlin & Altschul, 1990. PNAS
For an HSP, how is the raw score calculated?
S = ∑sij
Paramters eg:
- A substitution matrix (eg BLOSUM62)
- A gap opening penalty (eg -12)
- A gap extension penalty (eg -1)
Consider an alignment raw score S = ∑sij
What is the formula for the alignment bit score?
S’ = [ (λ x S) - ln(K) ] / ln2
Consider an alignment raw score S = ∑sij
What is the formula for the E-value using the raw score?
E = Kmne-λS
Consider an alignment bit score S’ = [ (λ x S) - ln(K) ] / ln2
What is the formula for the E-value using the bit score?
E = mn(2 - S’)
BLAST Score probabilities
What three scores are there?
raw, bit, and E-value
BLAST Score probabilities
What is λ?
normalizing factor for the scoring system
BLAST Score probabilities
What is K?
the K parameter scales the E-value based on the database and sequence lengths.
(alignments starting at different places in two sequences may be highly correlated)
What is the E-Value?
Expect value (E)
* under comparable conditions: expected no. of matches by chance with S’ ≥ S’obs
* can be much smaller or greater than 1
* there may be many or no matches with E «_space;1, depending on homologs in the database
The smaller the E-value, the better the match.
What does the E-Value depend on?
depends on
- alignment score
- length of query (m)
- size of database (n)
What do we need to consider regarding scoring between different databases?
The E-Value cannot be compared across searches of different
databases
What do we know about values E can take?
can be much smaller or greater than 1
there may be many or no matches with E «_space;1, depending on homologs in the database
How is the database size denoted?
What effect does it have on the E-value and why?
n
if n increases, the E-value increases (worse E-value)
–> a sequence hit would get a better E-value when present in a smaller database
Why: large databases increase the chance of false positive hits, the E-value corrects for the higher chance
How is the sequence length denoted?
What effect does it have on the E-value?
What happens with very small sequence lengths?
m
according to equation for E, if m increases, E increases
However, short identical sequence may have a high E-value and may be regarded as “false positive” hits.
This is often seen if one searches for short primer regions, small domain regions etc
What are low-complexity regions?
(LCRs)
alignment statistics require that symbols occur randomly in strings
long substrings of one or a few symbols violate this assumption
When / why should LCRs be filtered?
in homology search
to improve sensitivity and specificity and to avoid misinterpretation / artifacts
How does BLAST deal with LCRs
Masking low-complexity regions:
sequences are pre-processed to identify LCRs.
LCR are masked –>
* they do not contribute to alignment or score
* they appear as X’s in the BLAST alignment
tools:
- DUST for DNA,
- SEG for protein sequences
other types of repeats may also be masked by BLAST
How can you increase sensitivity in BLAST?
Smaller Word Size:
–> requires exact matches of shorter subsequences
Lower T- value (neighborhood word threshold T)
A lower value of T increases probability of finding weak similarities.
decrease –> number of neighboring words will increase
Increase E-value threshold
allowing matches with lower significance to be reported. (However, may also increase false-positive matches)
Relaxing Gap Penalties:
extending or opening gaps more easily - allowing for more flexible alignments. (However, maybe more false positives)
Different substitution matrix:
Choosing a less specific scoring matrix (e.g BLOSUM with lower number) –> better reflecting evolutionary divergence between sequences.
How can you increase specificity in BLAST?
larger Word Size:
allowing more mismatches (However, may reduce sensitivity)
Reduce E-value threshold:
filtering out matches with lower significance. (may reduce sensitivity)
Scoring Matrix: Choosing a more specific scoring matrix (eg BLOSUM with higher number)
Gap Penalties:
extending or opening gaps less easily - stricter alignments
How can you increase the speed of a BLAST search?
Decrease the sensitivity with following paramters:
Word size ??
E-value threshold - raise
Database size: smaller, or prefilter:
Search space size: increase neighborhood word threshold T –> number of neighboring words will drop and thus limit the search space
EXAM QUESTION
4 parameter to increase sensitivity of BLAST (2022)
Default BLAST parameters are a good compromise between speed and sensitivity. List 4 parameters which you can change in a BLAST search in order to increase sensitivity. (2019)
4 BLAST parameter, how to change to increase sensitivity (or specificity (2020)
Word size
T- value
E-value threshold
Gap Penalties:
Different substitution matrix:
Word size
increase sensitivity: larger
increase specificity: smaller
increase speed: smaller
T- value
increase sensitivity: lower
increase specificity: higher
increase speed: higher
Gap Penalties:
increase sensitivity: lower
increase specificity: higher
increase speed: higher
Different substitution matrix:
increase sensitivity: eg BLOSUM with lower number
increase specificity: BLOSUM with higher number
increase speed: BLOSUM with higher number
E-value threshold
increase sensitivity: higher
increase specificity: lower
increase speed: lower
What effect does word size have on sensitivity / specificity / speed?
Word size
increase sensitivity: larger
increase specificity: smaller
increase speed: smaller
What effect does T-value have on sensitivity / specificity / speed?
T- value
increase sensitivity: lower
increase specificity: higher
increase speed: higher
What effect does the E-value threshold have on sensitivity / specificity / speed?
E-value threshold
increase sensitivity: higher
increase specificity: lower
increase speed: lower
What effect do gap penalties have on sensitivity / specificity / speed?
Gap Penalties:
increase sensitivity: lower
increase specificity: higher
increase speed: higher
What effect do different substitution matrices have on sensitivity / specificity / speed?
Different substitution matrix:
increase sensitivity: eg BLOSUM with lower number
increase specificity: BLOSUM with higher number
increase speed: BLOSUM with higher number
What can you say about statistical vs. biological significance of BLAST results?