L3 Flashcards
Retrieval of biological sequences in databases is based on what?
Similarity
Searching biological sequence databases involves?
Submission of a query sequence and performing a pairwise comparison query with all individual sequences in a database
Requirements for implementing algorithms for sequence database searching include
- sensitivity
- selectivity
- speed
Sensitivity
Refers to the ability to find as many correct hits as possible. The correct hits are considered true positives
Selectivity
also called specificity, which refers to the ability to exclude incorrect hits. These
incorrect hits are considered “false positives.”
Speed
which is the time it takes to get results from database searches
An increase in sensitivity leads to
a decrease in selectivity
an increase in speed leads to
a decrease in sensitivity and selectivity
What are the types of algorithms in database searching
- exhaustive
- heuristic
Exhaustive algorithm
makes use of a rigorous algorithm to find the best or exact solution for a particular problem by examining all mathematical combinations
Heuristic algorithm
a computational strategy to find the near optimal solution
How do heuristic algorithms take shortcuts
by reducing space according to some criteria
what are the methods used to infer sequence similarity
Global and Local alignment
Local alignment
Finds domains and short regions of similarity between a pair of sequences eg
-looking for domains within proteins
-looking for regions of genomic DNA that contain introns
Global alignment
Finds the optimal alignment over the entire length of the two sequences under comparison eg
-genes are being aligned whose sequences are of comparable length
-entire gene is homologous
what does BLAST stand for
Blasic Local Alignment Search Tool
How does BLAST work
It uses heuristics to align a query sequence with all sequences in a database. Its objective is to find high-scoring segments among related sequences.
How does BLAST perform sequence alignment
- reads in query sequence
- Create a list of words from the query sequence (seeding) 3 RESIDUES FOR PROTEIN, 11 FOR DNA SEQUENCES
- Search a sequence database for the occurrence of these words.
- matching of the words is scored by a given substitution matrix
- Pairwise alignment
The resulting contiguous aligned segment pair without gaps is called what
high-scoring segment pair
Database search programs such as BLAST use
scoring/substitution matrices
Scoring matrices are what
empirical weighting schemes
Possible identities and substitutions are assigned a score based on the?
observed frequencies of such occurrences in alignments of related proteins
What does BLASTN do
queries nucleotide sequences with a nucleotide sequence database
How does BLASTP work
uses protein sequences as queries to search against a protein sequence
database. Default word size is 3
How does BLASTX work
uses translated nucleotide sequences as queries which are used to query a
protein sequence database.
How does TBLASTN
queries protein sequences to a nucleotide sequence database with the DNA
sequences translated.
How does TBLASTX work
uses nucleotide sequences, which are to search against a nucleotide sequence
database that has all the sequences translated also
What is BLAST used for?
- to detect similarity between sequences of interest.
- to determine whether there are other plausible alignments between query and target sequences
What is the BLAST E-value
it provides information about the likelihood that a given sequence match is
purely by chance. The lower the E-value, the less likely the database match is a result of
random chance.
HSPs significances are determined by Blast using the Karlin-Altschul equation
E = kmNe -lamda(s)
E stands for
the expectation value
k and lamda are what?
Karlin-Altschul constants
m stands for
the number of letters (amino acids/nucleotides) in the query
N is the
the total number of letters (aa/nuc) in the database
If E < 1e− 50 (or 1 × 10−50),
there should be an extremely high confidence that the database match is a result of homologous relationships.
If E is between 0.01 and 1e− 50,
the match can be considered a result of homology
If E is between 0.01 and 10,
the match is considered not significant, but may hint at a tentative remote homology relationship.
If E > 10,
the sequences under consideration are either unrelated or related by
extremely distant relationships that fall below the limit of detection with the current
method.