8. BLAST Flashcards

Question 1

Q

What pairwise alignment approaches are there?

Answer

A

dynamic programming (local, global)

heuristics

Question 2

Q

Dynamic programming approaches for pairwise alignment:

What do they guarantee?

Are they feasible?

Answer

A

guarantee the optimal alignment for two sequences

Work for small applications, but not feasible for big datasets - too slow or memory intensive for many applications

However, part of many approaches/programs!

Question 3

Q

What are two heuristics for alignment that are based on dynamic programming?

Answer

A

BLAST, FASTA.

Question 4

Q

BLAST is a heuristic.

What are two dynamic programming algorithms which BLAST takes some inspiration from?

Can they be used for genomescale alignments or big datasets?

Answer

A

1st DP for sequence analysis: NeedlemanWunsch (global alignments)

Then: SmithWaterman (local alignments).

too computationally intensive for big datasets

Heuristics make alignments practical for big databases/ datasets eg:
- NCBI
- from NextGeneration Sequencing (NGS).

Question 5

Q

Heuristic approaches for pairwise alignment (vs DP):

Basic idea to speed up?

Answer

A

exclude unpromising regions to speed up the search

Question 6

Q

Heuristic approaches for pairwise alignment (vs DP):

Name 4 approaches to allow speeding up?

Name 3 Applications?

Answer

A

Approaches:
- suffix trees
- dotplots
- pre-processing
- using substitution matrices, etc

Applications:
- lots of short NGS reads & reference: read mapping
- very long sequences: genome alignment
- lots of (gene) sequences: database search

Question 7

Q

MUMmer: genome alignment

stands for?

Answer

A

Maximal Unique Matches between strains: series of matches in the same order, translocations

Question 8

Q

MUMmer: genome alignment

goal?

data structure used?

Answer

A

goal:
globally compare two (similar) genomes

data structure: suffix tree

Question 9

Q

MUMmer: genome alignment

use/aim?

approach?

speed?

Answer

A

used for comparing different genomes assemblies to one another, which allows scientists to determine how a genome has changed

approach:
* construct a suffix tree of a reference sequence
* stream a query sequence against the suffix tree
* identify short exact matches between the two sequences
- use these directly
- extend to longer inexact alignments
* very fast!

Question 10

Q

Sequence alignment starting with:

one sequence vs many sequences

we want to?

Answer

A

database search - find database sequences that are similar (homologous) to the query sequence:

identify similar (regions between) sequences
collect related sequences for comparative analysis
(taxonomically) identify a sequence (fragment)
…

Question 11

Q

What are the results of a database search?

Answer

A

alignment(s) between a query sequence and one or more database sequences
ranked by statistical significance (E-value)

Question 12

Q

BLAST

basic steps that you do before BLAST search?

Answer

A

You:
* select query sequence
* select database to be searched
* pre-process / format database
* decide which BLAST program to use
* select parameters for search
* execute BLAST search
* interpret biological significance

Question 13

Q

BLAST:

basic steps that BLAST does?

Answer

A

seeding
extension to a good longer alignment
evaluation of statistical significance
presentation of ranked alignments

Question 14

Q

What BLAST programs are there?

Answer

A

QUERY                           / DATABASE                       / PROGRAM
nucleotide                      / nucleotide                        / blastn
nucleotide (translated) / nucleotide (translated)  / tblastx
nucleotide (translated) / peptide                             / blastx
peptide                            / peptide                             / blastp
peptide                            / nucleotide (translated)  / tblastn

Question 15

Q

What BLAST variants are there?

Answer

A

MEGABLAST: find highly similar DNA sequences
PSI-BLAST: find distant members of a protein family or build a custom position-specific score matrix
and many more

Question 16

Q

How BLAST works: seeding

What are used as seeds and why?

Answer

A

compile short words from query that provide high scores, look for identical matches to these words in the database

the locations of all high scoring word neighbors (word hits) in the db sequences are identified and used as alignment seeds

Why?
- inexact matching is slow, exact matching is fast
- biologically significant matches contain short regions of identities or high-scoring matches

Question 17

Q

How BLAST works: seeding

What counts as a word hit?

Answer

A

Query is broken down into short overlapping words - default: 11 nt or 3 (5) aa
up to 50 high scoring word neighbors with minimum score T are determined for each word in the query

Question 18

Q

How BLAST works: seeding

eg which scoring matrix?

Question 19

Q

What is the most time-consuming step of BLAST?

Answer

A

extension

Question 20

Q

BLAST:

What is the name of the algorithm for extension?

Answer

A

two-hit algorithm (1997)

Question 21

Q

BLAST:

What can be discarded before extension and why?
Which segment pairs are extended?

Answer

A

most database sequences can be discarded before the extension step (if they don’t have any segment pairs)
segment pairs on the same diagonal and within c cells of each other are extended in both directions

Question 22

Q

BLAST:

when does extension stop?

Answer

A

extension proceeds until drop-off from highest score is too great:
- X: value of how much the score is allowed to drop off since the last maximum
- if X is reached, the alignment is trimmed back to the point with the maximum score

Question 23

Q

BLAST: What happens with extended regions once extension has stopped?

Answer

A

extended regions are joined to form a gapped alignment, or high-scoring segment pair (HSP)

Question 24

Q

What are HSPs and how are they evaluated?

Answer

A

High-scoring segment pair (HSP): extended regions joined to form a gapped alignment

each HSP score is evaluated
- the statistical significance of HSPs are determined
- are two sequence fragments significantly more similar than expected by chance?
- HSPs are ranked by their E(xpect)-value and reported

Question 25

Q

BLAST: Evaluation

Which scores need to be evaluated?

What scores does BLAST compare them to and how are these obtained?

Answer

A

each HSP has associated score
- but how good is this score?
- is it significantly better than for random sequences?

–> We need to compare the score of an HSP with scores of random sequences of equal length & composition.

We could:
- evaluate empirically
- evaluate analytically

Empirically not done in practice!

Question 26

Q

In Evaluation of an alignment, BLAST needs to compare the score of an HSP with scores of random sequences of equal length & composition.

Using what distribution are these scores calculated?

Answer

A

derive theoretically
* best scores: follow Gumbel extreme value distribution
* use to compute the probability of obtaining a score equal to or greater than a given score by chance
* Karlin & Altschul, 1990. PNAS

Question 27

Q

For an HSP, how is the raw score calculated?

Answer

A

S = ∑s_ij

Paramters eg:
- A substitution matrix (eg BLOSUM62)
- A gap opening penalty (eg -12)
- A gap extension penalty (eg -1)

Question 28

Q

Consider an alignment raw score S = ∑s_ij

What is the formula for the alignment bit score?

Answer

A

S’ = [ (λ x S) - ln(K) ] / ln2

Question 29

Q

Consider an alignment raw score S = ∑s_ij

What is the formula for the E-value using the raw score?

Answer

A

E = Kmne^-λS

Question 30

Q

Consider an alignment bit score S’ = [ (λ x S) - ln(K) ] / ln2

What is the formula for the E-value using the bit score?

Answer

A

E = mn(2^{- S’})

Question 31

Q

BLAST Score probabilities

What three scores are there?

Answer

A

raw, bit, and E-value

Question 32

Q

BLAST Score probabilities

What is λ?

Answer

A

normalizing factor for the scoring system

Question 33

Q

BLAST Score probabilities

What is K?

Answer

A

the K parameter scales the E-value based on the database and sequence lengths.

(alignments starting at different places in two sequences may be highly correlated)

Question 34

Q

What is the E-Value?

Answer

A

Expect value (E)
* under comparable conditions: expected no. of matches by chance with S’ ≥ S’obs
* can be much smaller or greater than 1
* there may be many or no matches with E &laquo_space;1, depending on homologs in the database

The smaller the E-value, the better the match.

Question 35

Q

What does the E-Value depend on?

Answer

A

depends on
- alignment score
- length of query (m)
- size of database (n)

Question 36

Q

What do we need to consider regarding scoring between different databases?

Answer

A

The E-Value cannot be compared across searches of different
databases

Question 37

Q

What do we know about values E can take?

Answer

A

can be much smaller or greater than 1

there may be many or no matches with E &laquo_space;1, depending on homologs in the database

Question 38

Q

How is the database size denoted?

What effect does it have on the E-value and why?

Answer

A

n

if n increases, the E-value increases (worse E-value)
–> a sequence hit would get a better E-value when present in a smaller database

Why: large databases increase the chance of false positive hits, the E-value corrects for the higher chance

Question 39

Q

How is the sequence length denoted?

What effect does it have on the E-value?

What happens with very small sequence lengths?

Answer

A

m

according to equation for E, if m increases, E increases

However, short identical sequence may have a high E-value and may be regarded as “false positive” hits.

This is often seen if one searches for short primer regions, small domain regions etc

Question 40

Q

What are low-complexity regions?

Answer

A

(LCRs)

alignment statistics require that symbols occur randomly in strings

long substrings of one or a few symbols violate this assumption

Question 41

Q

When / why should LCRs be filtered?

Answer

A

in homology search

to improve sensitivity and specificity and to avoid misinterpretation / artifacts

Question 42

Q

How does BLAST deal with LCRs

Answer

A

Masking low-complexity regions:

sequences are pre-processed to identify LCRs.

LCR are masked –>
* they do not contribute to alignment or score
* they appear as X’s in the BLAST alignment

tools:
- DUST for DNA,
- SEG for protein sequences

other types of repeats may also be masked by BLAST

Question 43

Q

How can you increase sensitivity in BLAST?

Answer

A

Smaller Word Size:
–> requires exact matches of shorter subsequences

Lower T- value (neighborhood word threshold T)
A lower value of T increases probability of finding weak similarities.
decrease –> number of neighboring words will increase

Increase E-value threshold
allowing matches with lower significance to be reported. (However, may also increase false-positive matches)

Relaxing Gap Penalties:
extending or opening gaps more easily - allowing for more flexible alignments. (However, maybe more false positives)

Different substitution matrix:
Choosing a less specific scoring matrix (e.g BLOSUM with lower number) –> better reflecting evolutionary divergence between sequences.

Question 44

Q

How can you increase specificity in BLAST?

Answer

A

larger Word Size:
allowing more mismatches (However, may reduce sensitivity)

Reduce E-value threshold:
filtering out matches with lower significance. (may reduce sensitivity)

Scoring Matrix: Choosing a more specific scoring matrix (eg BLOSUM with higher number)

Gap Penalties:
extending or opening gaps less easily - stricter alignments

Question 45

Q

How can you increase the speed of a BLAST search?

Answer

A

Decrease the sensitivity with following paramters:

Word size ??
E-value threshold - raise
Database size: smaller, or prefilter:
Search space size: increase neighborhood word threshold T –> number of neighboring words will drop and thus limit the search space

Question 46

Q

EXAM QUESTION

4 parameter to increase sensitivity of BLAST (2022)

Default BLAST parameters are a good compromise between speed and sensitivity. List 4 parameters which you can change in a BLAST search in order to increase sensitivity. (2019)

4 BLAST parameter, how to change to increase sensitivity (or specificity (2020)

Answer

A

Word size

T- value

E-value threshold

Gap Penalties:

Different substitution matrix:

Word size
increase sensitivity: larger
increase specificity: smaller
increase speed: smaller

T- value
increase sensitivity: lower
increase specificity: higher
increase speed: higher

Gap Penalties:
increase sensitivity: lower
increase specificity: higher
increase speed: higher

Different substitution matrix:
increase sensitivity: eg BLOSUM with lower number
increase specificity: BLOSUM with higher number
increase speed: BLOSUM with higher number

E-value threshold
increase sensitivity: higher
increase specificity: lower
increase speed: lower

Question 47

Q

What effect does word size have on sensitivity / specificity / speed?

Answer

A

Word size
increase sensitivity: larger
increase specificity: smaller
increase speed: smaller

Question 48

Q

What effect does T-value have on sensitivity / specificity / speed?

Answer

A

T- value
increase sensitivity: lower
increase specificity: higher
increase speed: higher

Question 49

Q

What effect does the E-value threshold have on sensitivity / specificity / speed?

Answer

A

E-value threshold
increase sensitivity: higher
increase specificity: lower
increase speed: lower

Question 50

Q

What effect do gap penalties have on sensitivity / specificity / speed?

Answer

A

Gap Penalties:
increase sensitivity: lower
increase specificity: higher
increase speed: higher

Question 51

Q

What effect do different substitution matrices have on sensitivity / specificity / speed?

Answer

A

Different substitution matrix:
increase sensitivity: eg BLOSUM with lower number
increase specificity: BLOSUM with higher number
increase speed: BLOSUM with higher number

Question 52

Q

What can you say about statistical vs. biological significance of BLAST results?