database searching Flashcards
1
Q
BLAST
A
- very fast local search program
- 50x faster than SW
- local is important
- may only have one section/domain that is homologous
- finds ‘words’ in the query that have matches in the database
- extend to form high scoring pairs that can’t be extended
- evaluate statistical significance of hsp
- heuristic
- tradeoff between accuracy and speed
2
Q
search sensitivity
A
- psiblast = iterative version of blast
- increased sensitivity
- accumulates similar sequences
- protein searches mroe sensitive
- no similarity in nucleotide searches (same or not)
3
Q
blast output
A
- expect value and number of identities most important
- gaps also
- short alignments can have high identity but more likely to be due to chance
- longer sequence = higher bit score
4
Q
bit score
A
- re-scaled version of raw alignment score
- independent of search space size
- query sequence lenght n, sum of all database sequence lengths m, then search space is proportional to N = nm
5
Q
e value
A
- expected number of matches with same bit score or better that would be produced by random chance
- takes into account size of search space
- bigger database increases chance of finding a match
- indication of real evolutionary relationship between query and match
- false positive rate
- greater than 1 indicates no statistical support for the relationship
6
Q
P value
A
- probability of obtaining a match by chance
- different to E value (E can be greater than 1)
- often the same
7
Q
multiple sequence alignments
A
- proteins form families with related sequences, structures and functions
- can learn more by aligning multiple sequences from related family members instead of pairwise alignments
- patterns of conservation reflect structural and functional evolutionary constraints
- e.g. loops - less important for function, well conserved
- key functional residues often show strong conservation
- e.g. ser proteases - conserved triad of ser, asp, his
8
Q
MSA algorithms
A
- more sequences so slower than pairwise alignments
- use heuristic methods
- e.g. clustal
9
Q
clustal
A
- build guide tree
- perform all pairwise alignments
- group similar sequences
- higher score → neighbouring branches
- guides other pairwise alignments (more information)
- align sequences progressively
- most closely related pairs aligned first
- align next most closely related sequences to existing alignments
10
Q
MSA programs
A
- clustal omega
- T-coffee
- slow, best for smaller alignments
- align multiple MSAs
- MUSCLE
- fast
- estimates sequence similarity using short sequence words of n-residues
- align small words and build up
11
Q
sequence profiles
A
- weight profile or PSSM
- captures information in MSA
- table of residue frequencies at each position of the alignment
- can include gaps
- profile size nx21
- number of sequences irrelevant
- only considers information in that set of sequences
- similar rsidues not included are irrelevant
- can add to BLOSUM
- use to search PSI-BLAST for mor eremote homologues
12
Q
problems with frequency scores
A
- new sequence can differ from profile derived from existing set
- doesn’t include evolutionary knowledge of conservative subsitutions
13
Q
PSSM
A
- position specific scoring matrix
- substitution matrix (like BLOSUM) combined with observed frequencies
14
Q
PSI-BLAST
A
- position specific iterated blast
- form PSSM with MSA of blast search
- closely related sequence sonly due to noise
- search again with this profile
- add more significant hits to refine profile
- iterate until no more significant hits found (or for x iterations)
- bridges sequence space to find more distantly related homologues
- start with more conservative e value to ensure you get the right sequences
15
Q
PSI-BLAST
low complexity regions
A
- repeats of a few residues, often a coil
- limited information - unrelated protein sequences brought in
- need to be masked
- SEG program in blast
- run with and without - no clear default