Database searching Flashcards
Hvilke to algoritmer er ofte brugt til at sammenligne en query sekvens med en database af sekvenser?
BLAST og FASTA
Hvordan virker FASTA?
FASTA arbejder ud fra K-tuples, som er små sekvenser af k sammenhængende residues, hvilket ofte er omkring 6 nukleotider eller 2 amino syre.
Fasta finder alle k-tuples i query sekvensen, som gemmes sammen med deres lokation i sekvensen, dette kaldes hashing.
Alle identiske k-tuples i database sekvenser identificeres
K-tuple udvides på begge sider indtil det når en bestemt threshold med scoren fra en substitution matrice
Det højest scorede alignes med query sekvensen gennem dynamic programming.
What is the E-value?
An E-value is the expected number of times the given score would appear in a random database of the given size.
An E-value of 3.0 for example indicates that, by chance alone, you would expect to find 3 random sequence alignments the generate a specific score. An alignment with E-value 3.0 therefore suggest that the database sequence is not related to the query sequence.
For sequences that are related, the E-value will be very small, usually around 10^-20 or less.
What is the E-value dependant on?
The E-value is dependent on the length of the query sequence and the size of the database.
What is low complexity regions and how can they complicate homology search?
Low complexity regions are regions with a highly biased amino acid composition. Self comparison dot plots can sometimes identify these regions.
Such alignments can achieve high scores in alignments, but obscure biological significant hits.
What is a suffix tree?
A suffix tree is a way to find and visualize a sequence’s suffixes. A suffix is the shortest sub-sequence starting at a particular position that is unique in the complete sequence and can be used to identify that position.
How is a suffix tree constructed?
First, all positions in the sequence are grouped according to their base type/amino acid type, which then fills the nodes of the first row.
second, the groups is then regrouped according to the following base to give the second row of nodes.
This procedure is continued, stopping for a group when it only contains one sequence position
What is a protein motif?
Protein motifs are small regions of three-dimensional structure or amino acid sequence shared among different proteins with a common biological property. They are recognizable regions of protein structure that may (or may not) be defined by a unique chemical or biological function.
What is hashing?
Hashing is a way to construct a list of the starting position of all k-tuples (as used in FASTA) that occur in a query sequence. Where a particular k-tuple occurs in a sequence can then be found by looking up the list.
How is a list constructed by hashing?
In a nucleotide sequence, each base is assigned a number a = 0, C = 1, G = 2, T = 3. In a protein sequence, each amino acid would be assigned a number. Each k-tuple can then be assigned a number by the formula: c_i = e(x_i)4^2 + e(x_i)4^1+e(x_i)4^0. In this example, a k-tuple of k = 3 is used.
The list of k-tuples is then rank-ordered by their score.
How does BLAST work?
- It splits the query into overlapping words of length W
- It finds neighborhood words or each word until threshold T
- It looks in a table where these neighbor words occur; Seeds S
- Extend seed S until score drops off under x