Sequence Similarity Searching Flashcards
What is database structure determined by?
The requirements of designers/users
Complete this statement, databases can be local or…?
Remote
Complete this statement, querying can be manual or…?
Automated
What must providers such as NCBI/EBI balance across users?
Demand on computation resources
What does sequence similarity in DNA/proteins suggest?
Common ancestry
What might common ancestry imply?
Common function
What is the name given to homologs separated by a speciation event?
Orthologs
What is the name given to homologs separated by a duplication event?
Paralogs
Paralogs and orthologs are two types of homologous sequence, true or false?
True
What does the alignment or equivalencing of bases enable?
Maximisation of similarity
What could a database query look like?
Could simply be a sequence (DNA/protein)
Could be a logical structure, e.g. human + mitochondrial + HVS2
Why do sequence databases require specialised search tools?
Due to size and similarity
Is quantification of biological similarity easy or difficult?
Can be difficult
What can searching sequence databases for similar sequences predict about novel sequences?
Possible functions
What can alignments of sequences contain?
Mismatches and gaps
How are mismatches and gaps interpreted in sequence alignments?
As substitutions and indels respectively
What do alignment algorithms ideally try to identify about sequences?
The most likely evolutionary ‘path’ between sequences
What are databases?
Searchable collections of information
What does how we search databases depend on?
Database access, design and location
What does the quantification of sequence similarity require?
Alignment
What is the constant gap penalty?
Opening a gap of any size attracts a constant (a) negative score
= -a
What is the proportional gap penalty?
Opening a gap attracts a penalty proportional to its length (L)
= -(aL)
What is the affine gap penalty?
Opening a gap attracts a constant (a), extending it attracts a penalty (b) proportional to the gap’s length (L)
= -(a+bL) where a»b
What type of gap penalty is generally the most relevant biologically?
Affine
What does the choice of gap penalty depend on?
Software
What do amino acid side chains share?
Chemical properties (acidic/basic etc.)
What is the accepted theory about amino acid substitutions?
Chemically similar amino acids substitute more readily than chemically dissimilar amino acids
What is ‘built into’ amino acid substitution matrices?
Physico-chemical classification of amino acids
In the PAM250 (accepted point mutation) substitution matrix, what do similar amino acids score?
+ve score
In the PAM250 (accepted point mutation) substitution matrix, what do dissimilar amino acids score?
-ve score
What is the PAM250 (accepted point mutation) substitution matrix based on alignments of?
Closely related proteins
What are accepted point mutation substitution matrices extrapolated to?
Large (PAM120) and very large (PAM250) evolutionary distances
What is BLOSUM62 (blocks substitution matrix) based on alignments of?
Gap free alignments of short protein motifs (blocks)
What do the numbers represent in BLOSUM62 (blocks substitution matrix)?
Level of identity in alignments (BLOSUM62 = 62%)
BLOSUM is continually updated, true or false?
False, it is no longer updated but is still widely used
BLOSUM62 (blocks substitution matrix) has no extrapolation. What does this mean for distant relationships?
More reliable
What is the default amino acid substitution matrix of BLAST?
BLOSUM62 (blocks substitution matrix)
What are the types of BLAST searches and their uses?
Nucleotide query versus nucleotide database, i.e. what gene is this?
Protein query versus protein database, i.e. what protein is this?
Translated nucleotide query versus protein database, i.e. does this DNA sequence code for a known protein?
Protein query versus translated nucleotide database, i.e. can we identify a DNA sequence that might encode this protein?
What are the 4 sections in the results page of a BLAST search?
Search information (including RID) Graphical summary (conserved domain search) Results table (hyperlinked to alignments) Alignments (download links)
Databases contain more information than can be searched practically by observation, true or false?
True
Most databases are relational. What does this mean?
The data are organised into table with defined inter-relationships
Does the manual querying of remote databases require specialist knowledge?
Little
What might automated querying of local databases enable?
Greater throughput and flexibility
Cheaper hardware for databases can increase locally available resources but what does it also make quite costly?
Administration
What do highly similar (>80%) sequences probably share?
A common ancestor and thus probably are homologous
Why is it necessary to quantify similarity of sequences?
To distinguish between ‘chance’ and ‘real’ similarity (common ancestry)
What are alignments generally interpreted in terms of?
An explicit model of molecular evolution (substitutions and indels)
Sequences are either homologous or not, true or false?
True, i.e. they either share a common ancestor or they do not
Any two sequences can show similarity simply by ‘chance’, true or false?
True
Is the alignment of pairs of long sequences computationally easy or intensive?
Intensive due to a large number of equivalences
What do dynamic programming algorithms (DPAs) theoretically allow?
Exhaustive identification of optimal alignments
Are DPA methods slow or fast for searching large databases?
Often considered too slow
What can DPA methods find?
Optimal global or local alignments
Give an example of a global algorithm
Needleman-Wunsch
What do global algorithms try to find?
Similarity over the whole sequence
Give an example of a local algorithm
Smith-Waterman
What do local alignments try to find?
Just local regions
They do not try to align the whole sequence
Which is the most biologically relevant, global or local alignment?
Local alignment
Different algorithms will give similar alignments of the same sequences, true or false?
False, different algorithms can give very different alignments of the same sequences
What are the pros and cons of exhaustive algorithms to find optimal alignments?
Accurate but slow
Genes with shared common ancestors are homologous and so may be what?
Very similar in sequence
Do exhausative mathematical algorithms find the alignment with the maximum or minimum similarity?
Maximum
What does distinguishing between ‘real’ and ‘chance’ alignments require?
The comparison of some measure of similarity
What do scoring procedures need to take into account?
Biology
How can we measure similarity of sequences?
By scoring the alignment
What do E-values enable?
Comparison between searches and provide an objective threshold for significance
What can protein sequences identify?
More distant evolutionary relationships (depending on scoring matrix)
What can DNA searches find?
Evolutionary close relationships
BLAST is fast, but what might it miss?
Optimal alignments
DNA and protein databases can only be searched effectively using what algorithms?
Heuristic algorithms (speed)
How can we compare results when using different methods for alignment?
Using p values and E values
What does E value stand for?
The expect value
What do E values indicate?
How often a match at a given p value would be expected to occur in the database, by chance (i.e. when the sequences are unrelated)
Should ‘real’ alignments between sequences that do share a common ancestor show more or less similarity than ‘chance’ alignments of sequences that do not?
Greater similarity
What are the pros and cons of maximisation scores?
They ensure the best alignment but are not comparable
Are default scores and comparing alignments good to use?
Generally good, but experimentation is possible
What does the percentage of identically aligned residues allow?
Comparison of different alignments
How is the percentage of identically aligned residues calculated?
The number of matches by the length of the alignment and multiplied by 100
When scoring alignments, what is the assumption about nucleotides (G, A, T, C)?
They substitute equally for one another
What type of selection are protein sequences under for structure and function?
Stabilising selection
Are changes between chemically similar amino acids more or less likely to be deleterious?
Less likely
When scoring protein alignments, particular substitutions can have different scores. What does this depend upon?
Their chemical similarity, e.g. LEU to ILE or PRO to TRP
Almost any 2 sequences can be aligned by using what?
Gaps (indels)
How is the quality of an alignment assessed?
Using a scoring matrix
Matches are +ve, mismatches are 0, gaps are -ve
What do heavy gap penalties help the discovery of?
Biologically meaningful alignments
What do gaps in a query represent?
Insertion/deletion events, relative to ancestor
How are mismatches (substitutions) in an alignment scored?
Equivalent
G>T = G>C = -1
Algorithms maximise the score of indels but what does biology indicate about indels?
That they should be rare
What are the pros and cons of heuristic algorithms?
They are not guaranteed to find the best alignment but are much faster (by 5-20x)
What is the main problem with DPAs?
They are too slow for large databases
What does sequence database searching involve?
Aligning query to all database sequences and ranking ‘hits’ by similarity/quality
What is the main advantage to using DPAs?
They are guaranteed to find the highest scoring alignment (but biological reality?)
What does alignment allow?
Quantification of similarity between sequences
What does the statistical theory of alignment enable?
P values to be calculated
What is a p value?
The probability of observing as high scoring alignment between two unrelated sequences of similar length and composition
What p value indicates a significant match?
p<0.05
Why are E values greater than 0.01 unlikely to reflect real matches?
Since as good a match would be expected to occur at least 1% of the time, even for unrelated sequences
E values are always calculated in the same way, true or false?
False
How are E values calculated for BLAST?
E = pX
Where X is the total length of all the sequences in the database (treats the database as one large sequence)
How are E values calculated for FASTA?
E = pN
Where N is the number of sequences
What is the default word size (W) for BLAST?
3 for proteins
11 for DNA
BLAST searches only for word matches above what?
A threshold, T
What did early versions of BLAST only allow for?
Ungapped alignments (but T allows mismatches)
What happens to high-scoring segment pairs (HSPs) along the BLAST query?
They are reported and ordered by score
What do current versions of BLAST allow which improves its sensitivity?
Gapped alignments
In BLAST searches, all matches above T are extended until what?
The introduction of gaps causes the alignment score to fall quickly
Alignments are scored in order to quantify what?
Similarity
What are the most common heuristic algorithms?
FASTA and BLAST
Initial alignment hits are examined to see if they can be what?
Extended
Generally speaking, what do heuristic methods do?
Break the query into short ‘words’ and look for matches above a threshold level in the subject database
What do DPAs find?
A computationally optimal solution to the alignment problem
What do heuristic methods assume?
That high scoring alignments contain short regions of exact matches