3 - FASTA, BLAST, sequence similarity statistics and homology detection Flashcards

Question 1

Q

How are FASTA and BLAST different from NW and SW methods?

Answer

A

NW and SW are good for doing pairwise alignments between two sequences of interest.

But it’s way too slow to compare millions of pairwise alignments.

FASTA and BLAST are heuristic methods that can do this, though they are not guaranteed to find the globally optimal alignments (definition of heurisitic)

Though, both FASTA and BLAST are loosely based on the Smith-Waterman algoritithm

Question 2

Q

What is the nr database?

Answer

A

Non-redundant database (non-currated / largest)

Question 3

Q

Give the steps of the FASTA algorithm

Answer

A

Makes words of length ktup
For each database sequence entry, search for identities of words
Rescan top 10 regions with BLOSUM or PAM
Trim ends to include only residues contributing to the highest score. Attempt to join together compatible ungapped alignments with gap penalty (initn score).
If score is > threshold, perform SW alignment to get optimized score
Normalize optimum scores for sequence length (need to correct for longer sequences having increased opt scores). You get a Z score
Evaluate statistical significance using the extreme value distribution (E score)

Question 4

Q

List five types of BLAST

Answer

A

BLASTN
BLASTP
BLASTX (nucleotides - AA, tries the 6 different reading frames)
TBLASTN (protein query against nucleotides translated into protein, for identifying transcripts)
TBLASTX (Cross species gene prediction at the genome or transcript level, searching for genes missed by traditional methods or not yet in protein database.

Question 5

Q

What does E mean in BLAST/FASTA?

Answer

A

THe expected number of scores equal or greater in identity to the alignment you’ve found.

Question 6

Q

Give the steps of BLAST

Answer

A

Seeding
- FInding matches of ungapped strings
- Breaking sequences into triplet words
- Find subs of each word that have similarity scores values >= T (threshold)
- Stores the set of similar words
- Searches database sequence for matching words
Extension
- Tries to extend forward and backwards word matches to make longer matches using BLOSUM62 (or another matrix)
- Alignment is trimmed back to max score
Evaluation
- Is score >S (another threshold), then this High-scoring segment pair (HSP) is saved
- Attempt to combine consistent HSPs and calculate probability (P(N)) for consistent matches
- Reports significant HSPs
- Calculates significance (E-values) for HSPs using Karlin-Altschul statistics (extreme value distribution etc.)

Question 7

Q

How do you tell if an alignment is significant?

Answer

A

Searching a database

Use the distribution of similarity scores amongst non-hit alignments
Theory for distribution of ungapped alignment score (Karlin and Altschul statistics)

Single pairwise alignments
- Monte-carlo test to calculate p-value for the real S

Question 8

Q

What is the extreme value distribution?

Answer

A

Allows calculation of probability of a normalize similarity score for ungapped alignments
Gives a P-value and E-value (expect value) for a given database

The extreme value distribution is obtained by choosing the largest values from N independent and identically distributed samples from a population.

Question 9

Q

In FASTA and SW, what type of score distributions to you get in the output?

Answer

A

Expected (E-value)

- Observed (p-value)

Question 10

Q

give the formula for an expect value (E)

Answer

A

E = kmne^-λs

k: constant
m: query sequence length
n: database sequence length
λ: scaling constant
s: SCORE

3 - FASTA, BLAST, sequence similarity statistics and homology detection Flashcards

(10 cards)