Bioinformatics 8: advanced searching and multiple alignment Flashcards
Why might you want to filter query sequences?
Statistical models of alignment assume that all matching residues are of equal significance
But this is not the case i.e Poly-A etc. (low complexity) , short period repeats, generic protein secondary structures (coiled coils)
Essential in repeat rich genomes e.g. Human (45% repeating)
How could you filter query sequences?
use a ‘masked’ query sequence (less meaningful regions marked with null character)
Via filtering/masking programs
% identity which could be real or could be noise (as suggested by good friend Doolittle in 1981)
18-25% (Twilight zone)
Explain iterative searching (e.g. in BLAST) and how it identifies distantly related sequences
Protein A (query) and Protein C (Database) may be distantly related, but not detected by BLAST
A 3rd Protein B is initially detected in the database using Protein A query
Protein C is then detected by using Protein B as a query: an iteration
-> PSI (Position Specific Iterative) BLAST most widely used
Problems with iterative searching and provided solutions?
1) Number of BLAST searches significantly increases with each iteration
2) Erroneous results in first iteration can bias results
Solutions
1) Sequence profile stores existing matched sequences -> iterate until no new matches found
2) “triage” of sequences after first iteration required
What is a PHI-BLAST?
Pattern Hit Initiated BLAST
- an extension to PSI-BLAST using a pattern (e.g. insulin family motif) to start a search
Applications of MSA (Multiple sequence alignment)?
Finding new related sequences
Genome sequence assembly
Phylogeny (highly conserved sequences can help establish evolutionary tree)
Protein structure predicition (conserved domains, motifs etc.)
Purpose of progressive alignment? Overview?
As MSA is very computationally demanding due to scale , progressive alignment used to be faster yet still effective
Related sequences are progressively aligned by clustering (e.g. by programs like clustal) creating a ‘guide’ tree
-> sequences progressively aligned using this guide matrix