3 - FASTA, BLAST, sequence similarity statistics and homology detection Flashcards
How are FASTA and BLAST different from NW and SW methods?
NW and SW are good for doing pairwise alignments between two sequences of interest.
But it’s way too slow to compare millions of pairwise alignments.
FASTA and BLAST are heuristic methods that can do this, though they are not guaranteed to find the globally optimal alignments (definition of heurisitic)
Though, both FASTA and BLAST are loosely based on the Smith-Waterman algoritithm
What is the nr database?
Non-redundant database (non-currated / largest)
Give the steps of the FASTA algorithm
- Makes words of length ktup
- For each database sequence entry, search for identities of words
- Rescan top 10 regions with BLOSUM or PAM
- Trim ends to include only residues contributing to the highest score. Attempt to join together compatible ungapped alignments with gap penalty (initn score).
- If score is > threshold, perform SW alignment to get optimized score
- Normalize optimum scores for sequence length (need to correct for longer sequences having increased opt scores). You get a Z score
- Evaluate statistical significance using the extreme value distribution (E score)
List five types of BLAST
- BLASTN
- BLASTP
- BLASTX (nucleotides - AA, tries the 6 different reading frames)
- TBLASTN (protein query against nucleotides translated into protein, for identifying transcripts)
- TBLASTX (Cross species gene prediction at the genome or transcript level, searching for genes missed by traditional methods or not yet in protein database.
What does E mean in BLAST/FASTA?
THe expected number of scores equal or greater in identity to the alignment you’ve found.
Give the steps of BLAST
- Seeding
- FInding matches of ungapped strings
- Breaking sequences into triplet words
- Find subs of each word that have similarity scores values >= T (threshold)
- Stores the set of similar words
- Searches database sequence for matching words - Extension
- Tries to extend forward and backwards word matches to make longer matches using BLOSUM62 (or another matrix)
- Alignment is trimmed back to max score - Evaluation
- Is score >S (another threshold), then this High-scoring segment pair (HSP) is saved
- Attempt to combine consistent HSPs and calculate probability (P(N)) for consistent matches
- Reports significant HSPs
- Calculates significance (E-values) for HSPs using Karlin-Altschul statistics (extreme value distribution etc.)
How do you tell if an alignment is significant?
Searching a database
- Use the distribution of similarity scores amongst non-hit alignments
- Theory for distribution of ungapped alignment score (Karlin and Altschul statistics)
Single pairwise alignments
- Monte-carlo test to calculate p-value for the real S
What is the extreme value distribution?
- Allows calculation of probability of a normalize similarity score for ungapped alignments
- Gives a P-value and E-value (expect value) for a given database
The extreme value distribution is obtained by choosing the largest values from N independent and identically distributed samples from a population.
In FASTA and SW, what type of score distributions to you get in the output?
- Expected (E-value)
- Observed (p-value)
give the formula for an expect value (E)
E = kmne^-λs
k: constant
m: query sequence length
n: database sequence length
λ: scaling constant
s: SCORE