Sequence Similarity Searching Flashcards by Emily Harper

What is database structure determined by?

The requirements of designers/users

How well did you know this?

Not at all

Perfectly

Complete this statement, databases can be local or…?

Remote

How well did you know this?

Not at all

Perfectly

Complete this statement, querying can be manual or…?

Automated

How well did you know this?

Not at all

Perfectly

What must providers such as NCBI/EBI balance across users?

Demand on computation resources

How well did you know this?

Not at all

Perfectly

What does sequence similarity in DNA/proteins suggest?

Common ancestry

How well did you know this?

Not at all

Perfectly

What might common ancestry imply?

Common function

How well did you know this?

Not at all

Perfectly

What is the name given to homologs separated by a speciation event?

Orthologs

How well did you know this?

Not at all

Perfectly

What is the name given to homologs separated by a duplication event?

Paralogs

How well did you know this?

Not at all

Perfectly

Paralogs and orthologs are two types of homologous sequence, true or false?

True

How well did you know this?

Not at all

Perfectly

What does the alignment or equivalencing of bases enable?

Maximisation of similarity

How well did you know this?

Not at all

Perfectly

What could a database query look like?

Could simply be a sequence (DNA/protein)

Could be a logical structure, e.g. human + mitochondrial + HVS2

How well did you know this?

Not at all

Perfectly

Why do sequence databases require specialised search tools?

Due to size and similarity

How well did you know this?

Not at all

Perfectly

Is quantification of biological similarity easy or difficult?

Can be difficult

How well did you know this?

Not at all

Perfectly

What can searching sequence databases for similar sequences predict about novel sequences?

Possible functions

How well did you know this?

Not at all

Perfectly

What can alignments of sequences contain?

Mismatches and gaps

How well did you know this?

Not at all

Perfectly

How are mismatches and gaps interpreted in sequence alignments?

As substitutions and indels respectively

How well did you know this?

Not at all

Perfectly

What do alignment algorithms ideally try to identify about sequences?

The most likely evolutionary ‘path’ between sequences

How well did you know this?

Not at all

Perfectly

What are databases?

Searchable collections of information

How well did you know this?

Not at all

Perfectly

What does how we search databases depend on?

Database access, design and location

How well did you know this?

Not at all

Perfectly

What does the quantification of sequence similarity require?

Alignment

How well did you know this?

Not at all

Perfectly

What is the constant gap penalty?

Opening a gap of any size attracts a constant (a) negative score
= -a

How well did you know this?

Not at all

Perfectly

What is the proportional gap penalty?

Opening a gap attracts a penalty proportional to its length (L)
= -(aL)

How well did you know this?

Not at all

Perfectly

What is the affine gap penalty?

Opening a gap attracts a constant (a), extending it attracts a penalty (b) proportional to the gap’s length (L)
= -(a+bL) where a»b

How well did you know this?

Not at all

Perfectly

What type of gap penalty is generally the most relevant biologically?

Affine

How well did you know this?

Not at all

Perfectly

What does the choice of gap penalty depend on?

Software

What do amino acid side chains share?

Chemical properties (acidic/basic etc.)

What is the accepted theory about amino acid substitutions?

Chemically similar amino acids substitute more readily than chemically dissimilar amino acids

What is 'built into' amino acid substitution matrices?

Physico-chemical classification of amino acids

In the PAM250 (accepted point mutation) substitution matrix, what do similar amino acids score?

+ve score

In the PAM250 (accepted point mutation) substitution matrix, what do dissimilar amino acids score?

-ve score

What is the PAM250 (accepted point mutation) substitution matrix based on alignments of?

Closely related proteins

What are accepted point mutation substitution matrices extrapolated to?

Large (PAM120) and very large (PAM250) evolutionary distances

What is BLOSUM62 (blocks substitution matrix) based on alignments of?

Gap free alignments of short protein motifs (blocks)

What do the numbers represent in BLOSUM62 (blocks substitution matrix)?

Level of identity in alignments (BLOSUM62 = 62%)

BLOSUM is continually updated, true or false?

False, it is no longer updated but is still widely used

BLOSUM62 (blocks substitution matrix) has no extrapolation. What does this mean for distant relationships?

More reliable

What is the default amino acid substitution matrix of BLAST?

BLOSUM62 (blocks substitution matrix)

What are the types of BLAST searches and their uses?

Nucleotide query versus nucleotide database, i.e. what gene is this? Protein query versus protein database, i.e. what protein is this? Translated nucleotide query versus protein database, i.e. does this DNA sequence code for a known protein? Protein query versus translated nucleotide database, i.e. can we identify a DNA sequence that might encode this protein?

What are the 4 sections in the results page of a BLAST search?

``` Search information (including RID) Graphical summary (conserved domain search) Results table (hyperlinked to alignments) Alignments (download links) ```

Databases contain more information than can be searched practically by observation, true or false?

True

Most databases are relational. What does this mean?

The data are organised into table with defined inter-relationships

Does the manual querying of remote databases require specialist knowledge?

Little

What might automated querying of local databases enable?

Greater throughput and flexibility

Cheaper hardware for databases can increase locally available resources but what does it also make quite costly?

Administration

What do highly similar (>80%) sequences probably share?

A common ancestor and thus probably are homologous

Why is it necessary to quantify similarity of sequences?

To distinguish between 'chance' and 'real' similarity (common ancestry)

What are alignments generally interpreted in terms of?

An explicit model of molecular evolution (substitutions and indels)

Sequences are either homologous or not, true or false?

True, i.e. they either share a common ancestor or they do not

Any two sequences can show similarity simply by 'chance', true or false?

True

Is the alignment of pairs of long sequences computationally easy or intensive?

Intensive due to a large number of equivalences

What do dynamic programming algorithms (DPAs) theoretically allow?

Exhaustive identification of optimal alignments

Are DPA methods slow or fast for searching large databases?

Often considered too slow

What can DPA methods find?

Optimal global or local alignments

Give an example of a global algorithm

Needleman-Wunsch

What do global algorithms try to find?

Similarity over the whole sequence

Give an example of a local algorithm

Smith-Waterman

What do local alignments try to find?

Just local regions | They do not try to align the whole sequence

Which is the most biologically relevant, global or local alignment?

Local alignment

Different algorithms will give similar alignments of the same sequences, true or false?

False, different algorithms can give very different alignments of the same sequences

What are the pros and cons of exhaustive algorithms to find optimal alignments?

Accurate but slow

Genes with shared common ancestors are homologous and so may be what?

Very similar in sequence

Do exhausative mathematical algorithms find the alignment with the maximum or minimum similarity?

Maximum

What does distinguishing between 'real' and 'chance' alignments require?

The comparison of some measure of similarity

What do scoring procedures need to take into account?

Biology

How can we measure similarity of sequences?

By scoring the alignment

What do E-values enable?

Comparison between searches and provide an objective threshold for significance

What can protein sequences identify?

More distant evolutionary relationships (depending on scoring matrix)

What can DNA searches find?

Evolutionary close relationships

BLAST is fast, but what might it miss?

Optimal alignments

DNA and protein databases can only be searched effectively using what algorithms?

Heuristic algorithms (speed)

How can we compare results when using different methods for alignment?

Using p values and E values

What does E value stand for?

The expect value

What do E values indicate?

How often a match at a given p value would be expected to occur in the database, by chance (i.e. when the sequences are unrelated)

Should 'real' alignments between sequences that do share a common ancestor show more or less similarity than 'chance' alignments of sequences that do not?

Greater similarity

What are the pros and cons of maximisation scores?

They ensure the best alignment but are not comparable

Are default scores and comparing alignments good to use?

Generally good, but experimentation is possible

What does the percentage of identically aligned residues allow?

Comparison of different alignments

How is the percentage of identically aligned residues calculated?

The number of matches by the length of the alignment and multiplied by 100

When scoring alignments, what is the assumption about nucleotides (G, A, T, C)?

They substitute equally for one another

What type of selection are protein sequences under for structure and function?

Stabilising selection

Are changes between chemically similar amino acids more or less likely to be deleterious?

Less likely

When scoring protein alignments, particular substitutions can have different scores. What does this depend upon?

Their chemical similarity, e.g. LEU to ILE or PRO to TRP

Almost any 2 sequences can be aligned by using what?

Gaps (indels)

How is the quality of an alignment assessed?

Using a scoring matrix | Matches are +ve, mismatches are 0, gaps are -ve

What do heavy gap penalties help the discovery of?

Biologically meaningful alignments

What do gaps in a query represent?

Insertion/deletion events, relative to ancestor

How are mismatches (substitutions) in an alignment scored?

Equivalent | G>T = G>C = -1

Algorithms maximise the score of indels but what does biology indicate about indels?

That they should be rare

What are the pros and cons of heuristic algorithms?

They are not guaranteed to find the best alignment but are much faster (by 5-20x)

What is the main problem with DPAs?

They are too slow for large databases

What does sequence database searching involve?

Aligning query to all database sequences and ranking 'hits' by similarity/quality

What is the main advantage to using DPAs?

They are guaranteed to find the highest scoring alignment (but biological reality?)

What does alignment allow?

Quantification of similarity between sequences

What does the statistical theory of alignment enable?

P values to be calculated

What is a p value?

The probability of observing as high scoring alignment between two unrelated sequences of similar length and composition

What p value indicates a significant match?

p<0.05

Why are E values greater than 0.01 unlikely to reflect real matches?

Since as good a match would be expected to occur at least 1% of the time, even for unrelated sequences

E values are always calculated in the same way, true or false?

False

How are E values calculated for BLAST?

E = pX | Where X is the total length of all the sequences in the database (treats the database as one large sequence)

How are E values calculated for FASTA?

E = pN | Where N is the number of sequences

What is the default word size (W) for BLAST?

3 for proteins | 11 for DNA

BLAST searches only for word matches above what?

A threshold, T

What did early versions of BLAST only allow for?

Ungapped alignments (but T allows mismatches)

What happens to high-scoring segment pairs (HSPs) along the BLAST query?

They are reported and ordered by score

What do current versions of BLAST allow which improves its sensitivity?

Gapped alignments

In BLAST searches, all matches above T are extended until what?

The introduction of gaps causes the alignment score to fall quickly

Alignments are scored in order to quantify what?

Similarity

What are the most common heuristic algorithms?

FASTA and BLAST

Initial alignment hits are examined to see if they can be what?

Extended

Generally speaking, what do heuristic methods do?

Break the query into short 'words' and look for matches above a threshold level in the subject database

What do DPAs find?

A computationally optimal solution to the alignment problem

What do heuristic methods assume?

That high scoring alignments contain short regions of exact matches

Sequence Similarity Searching Flashcards

(112 cards)