8. BLAST Flashcards

1
Q

What pairwise alignment approaches are there?

A

dynamic programming (local, global)

heuristics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Dynamic programming approaches for pairwise alignment:

What do they guarantee?

Are they feasible?

A

guarantee the optimal alignment for two sequences

Work for small applications, but not feasible for big datasets - too slow or memory intensive for many applications

However, part of many approaches/programs!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are two heuristics for alignment that are based on dynamic programming?

A

BLAST, FASTA.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

BLAST is a heuristic.

What are two dynamic programming algorithms which BLAST takes some inspiration from?

Can they be used for genomescale alignments or big datasets?

A

1st DP for sequence analysis: Needleman­Wunsch (global alignments)

Then: Smith­Waterman (local alignments).

too computationally intensive for big datasets

Heuristics make alignments practical for big databases/ datasets eg:
- NCBI
- from Next­Generation Sequencing (NGS).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Heuristic approaches for pairwise alignment (vs DP):

Basic idea to speed up?

A

exclude unpromising regions to speed up the search

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Heuristic approaches for pairwise alignment (vs DP):

Name 4 approaches to allow speeding up?

Name 3 Applications?

A

Approaches:
- suffix trees
- dotplots
- pre-processing
- using substitution matrices, etc

Applications:
- lots of short NGS reads & reference: read mapping
- very long sequences: genome alignment
- lots of (gene) sequences: database search

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

MUMmer: genome alignment

stands for?

A

Maximal Unique Matches between strains: series of matches in the same order, translocations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

MUMmer: genome alignment

goal?

data structure used?

A

goal:
globally compare two (similar) genomes

data structure: suffix tree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

MUMmer: genome alignment

use/aim?

approach?

speed?

A

used for comparing different genomes assemblies to one another, which allows scientists to determine how a genome has changed

approach:
* construct a suffix tree of a reference sequence
* stream a query sequence against the suffix tree
* identify short exact matches between the two sequences
- use these directly
- extend to longer inexact alignments
* very fast!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Sequence alignment starting with:

one sequence vs many sequences

we want to?

A

database search - find database sequences that are similar (homologous) to the query sequence:

  • identify similar (regions between) sequences
  • collect related sequences for comparative analysis
  • (taxonomically) identify a sequence (fragment)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the results of a database search?

A
  • alignment(s) between a query sequence and one or more database sequences
  • ranked by statistical significance (E-value)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

BLAST

basic steps that you do before BLAST search?

A

You:
* select query sequence
* select database to be searched
* pre-process / format database
* decide which BLAST program to use

* select parameters for search
* execute BLAST search
* interpret biological significance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

BLAST:

basic steps that BLAST does?

A
  • seeding
  • extension to a good longer alignment
  • evaluation of statistical significance
  • presentation of ranked alignments
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What BLAST programs are there?

A
QUERY                           / DATABASE                       / PROGRAM
nucleotide                      / nucleotide                        / blastn
nucleotide (translated) / nucleotide (translated)  / tblastx
nucleotide (translated) / peptide                             / blastx
peptide                            / peptide                             / blastp
peptide                            / nucleotide (translated)  / tblastn
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What BLAST variants are there?

A
  • MEGABLAST: find highly similar DNA sequences
  • PSI-BLAST: find distant members of a protein family or build a custom position-specific score matrix
  • and many more
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How BLAST works: seeding

What are used as seeds and why?

A

compile short words from query that provide high scores, look for identical matches to these words in the database

the locations of all high scoring word neighbors (word hits) in the db sequences are identified and used as alignment seeds

Why?
- inexact matching is slow, exact matching is fast
- biologically significant matches contain short regions of identities or high-scoring matches

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How BLAST works: seeding

What counts as a word hit?

A
  • Query is broken down into short overlapping words - default: 11 nt or 3 (5) aa
  • up to 50 high scoring word neighbors with minimum score T are determined for each word in the query
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How BLAST works: seeding

eg which scoring matrix?

A

BLOSUM62

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the most time-consuming step of BLAST?

A

extension

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

BLAST:

What is the name of the algorithm for extension?

A

two-hit algorithm (1997)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

BLAST:

What can be discarded before extension and why?
Which segment pairs are extended?

A
  • most database sequences can be discarded before the extension step (if they don’t have any segment pairs)
  • segment pairs on the same diagonal and within c cells of each other are extended in both directions
22
Q

BLAST:

when does extension stop?

A

extension proceeds until drop-off from highest score is too great:
- X: value of how much the score is allowed to drop off since the last maximum
- if X is reached, the alignment is trimmed back to the point with the maximum score

23
Q

BLAST: What happens with extended regions once extension has stopped?

A

extended regions are joined to form a gapped alignment, or high-scoring segment pair (HSP)

24
Q

What are HSPs and how are they evaluated?

A

High-scoring segment pair (HSP): extended regions joined to form a gapped alignment

each HSP score is evaluated
- the statistical significance of HSPs are determined
- are two sequence fragments significantly more similar than expected by chance?
- HSPs are ranked by their E(xpect)-value and reported

25
Q

BLAST: Evaluation

Which scores need to be evaluated?

What scores does BLAST compare them to and how are these obtained?

A

each HSP has associated score
- but how good is this score?
- is it significantly better than for random sequences?

–> We need to compare the score of an HSP with scores of random sequences of equal length & composition.

We could:
- evaluate empirically
- evaluate analytically

Empirically not done in practice!

26
Q

In Evaluation of an alignment, BLAST needs to compare the score of an HSP with scores of random sequences of equal length & composition.

Using what distribution are these scores calculated?

A

derive theoretically
* best scores: follow Gumbel extreme value distribution
* use to compute the probability of obtaining a score equal to or greater than a given score by chance
* Karlin & Altschul, 1990. PNAS

27
Q

For an HSP, how is the raw score calculated?

A

S = ∑sij

Paramters eg:
- A substitution matrix (eg BLOSUM62)
- A gap opening penalty (eg -12)
- A gap extension penalty (eg -1)

28
Q

Consider an alignment raw score S = ∑sij

What is the formula for the alignment bit score?

A

S’ = [ (λ x S) - ln(K) ] / ln2

29
Q

Consider an alignment raw score S = ∑sij

What is the formula for the E-value using the raw score?

A

E = Kmne-λS

30
Q

Consider an alignment bit score S’ = [ (λ x S) - ln(K) ] / ln2

What is the formula for the E-value using the bit score?

A

E = mn(2 - S’)

31
Q

BLAST Score probabilities

What three scores are there?

A

raw, bit, and E-value

32
Q

BLAST Score probabilities

What is λ?

A

normalizing factor for the scoring system

33
Q

BLAST Score probabilities

What is K?

A

the K parameter scales the E-value based on the database and sequence lengths.

(alignments starting at different places in two sequences may be highly correlated)

34
Q

What is the E-Value?

A

Expect value (E)
* under comparable conditions: expected no. of matches by chance with S’ ≥ S’obs
* can be much smaller or greater than 1
* there may be many or no matches with E &laquo_space;1, depending on homologs in the database

The smaller the E-value, the better the match.

35
Q

What does the E-Value depend on?

A

depends on
- alignment score
- length of query (m)
- size of database (n)

36
Q

What do we need to consider regarding scoring between different databases?

A

The E-Value cannot be compared across searches of different
databases

37
Q

What do we know about values E can take?

A

can be much smaller or greater than 1

there may be many or no matches with E &laquo_space;1, depending on homologs in the database

38
Q

How is the database size denoted?

What effect does it have on the E-value and why?

A

n

if n increases, the E-value increases (worse E-value)
–> a sequence hit would get a better E-value when present in a smaller database

Why: large databases increase the chance of false positive hits, the E-value corrects for the higher chance

39
Q

How is the sequence length denoted?

What effect does it have on the E-value?

What happens with very small sequence lengths?

A

m

according to equation for E, if m increases, E increases

However, short identical sequence may have a high E-value and may be regarded as “false positive” hits.

This is often seen if one searches for short primer regions, small domain regions etc

40
Q

What are low-complexity regions?

A

(LCRs)

alignment statistics require that symbols occur randomly in strings

long substrings of one or a few symbols violate this assumption

41
Q

When / why should LCRs be filtered?

A

in homology search

to improve sensitivity and specificity and to avoid misinterpretation / artifacts

42
Q

How does BLAST deal with LCRs

A

Masking low-complexity regions:

sequences are pre-processed to identify LCRs.

LCR are masked –>
* they do not contribute to alignment or score
* they appear as X’s in the BLAST alignment

tools:
- DUST for DNA,
- SEG for protein sequences

other types of repeats may also be masked by BLAST

43
Q

How can you increase sensitivity in BLAST?

A

Smaller Word Size:
–> requires exact matches of shorter subsequences

Lower T- value (neighborhood word threshold T)
A lower value of T increases probability of finding weak similarities.
decrease –> number of neighboring words will increase

Increase E-value threshold
allowing matches with lower significance to be reported. (However, may also increase false-positive matches)

Relaxing Gap Penalties:
extending or opening gaps more easily - allowing for more flexible alignments. (However, maybe more false positives)

Different substitution matrix:
Choosing a less specific scoring matrix (e.g BLOSUM with lower number) –> better reflecting evolutionary divergence between sequences.

44
Q

How can you increase specificity in BLAST?

A

larger Word Size:
allowing more mismatches (However, may reduce sensitivity)

Reduce E-value threshold:
filtering out matches with lower significance. (may reduce sensitivity)

Scoring Matrix: Choosing a more specific scoring matrix (eg BLOSUM with higher number)

Gap Penalties:
extending or opening gaps less easily - stricter alignments

45
Q

How can you increase the speed of a BLAST search?

A

Decrease the sensitivity with following paramters:

Word size ??
E-value threshold - raise
Database size: smaller, or prefilter:
Search space size: increase neighborhood word threshold T –> number of neighboring words will drop and thus limit the search space

46
Q

EXAM QUESTION

4 parameter to increase sensitivity of BLAST (2022)

Default BLAST parameters are a good compromise between speed and sensitivity. List 4 parameters which you can change in a BLAST search in order to increase sensitivity. (2019)

4 BLAST parameter, how to change to increase sensitivity (or specificity (2020)

A

Word size

T- value

E-value threshold

Gap Penalties:

Different substitution matrix:

Word size
increase sensitivity: larger
increase specificity: smaller
increase speed: smaller

T- value
increase sensitivity: lower
increase specificity: higher
increase speed: higher

Gap Penalties:
increase sensitivity: lower
increase specificity: higher
increase speed: higher

Different substitution matrix:
increase sensitivity: eg BLOSUM with lower number
increase specificity: BLOSUM with higher number
increase speed: BLOSUM with higher number

E-value threshold
increase sensitivity: higher
increase specificity: lower
increase speed: lower

47
Q

What effect does word size have on sensitivity / specificity / speed?

A

Word size
increase sensitivity: larger
increase specificity: smaller
increase speed: smaller

48
Q

What effect does T-value have on sensitivity / specificity / speed?

A

T- value
increase sensitivity: lower
increase specificity: higher
increase speed: higher

49
Q

What effect does the E-value threshold have on sensitivity / specificity / speed?

A

E-value threshold
increase sensitivity: higher
increase specificity: lower
increase speed: lower

50
Q

What effect do gap penalties have on sensitivity / specificity / speed?

A

Gap Penalties:
increase sensitivity: lower
increase specificity: higher
increase speed: higher

51
Q

What effect do different substitution matrices have on sensitivity / specificity / speed?

A

Different substitution matrix:
increase sensitivity: eg BLOSUM with lower number
increase specificity: BLOSUM with higher number
increase speed: BLOSUM with higher number

52
Q

What can you say about statistical vs. biological significance of BLAST results?

A