Sequence Similarity Searching Flashcards

1
Q

What is database structure determined by?

A

The requirements of designers/users

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Complete this statement, databases can be local or…?

A

Remote

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Complete this statement, querying can be manual or…?

A

Automated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What must providers such as NCBI/EBI balance across users?

A

Demand on computation resources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does sequence similarity in DNA/proteins suggest?

A

Common ancestry

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What might common ancestry imply?

A

Common function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the name given to homologs separated by a speciation event?

A

Orthologs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the name given to homologs separated by a duplication event?

A

Paralogs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Paralogs and orthologs are two types of homologous sequence, true or false?

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does the alignment or equivalencing of bases enable?

A

Maximisation of similarity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What could a database query look like?

A

Could simply be a sequence (DNA/protein)

Could be a logical structure, e.g. human + mitochondrial + HVS2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why do sequence databases require specialised search tools?

A

Due to size and similarity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Is quantification of biological similarity easy or difficult?

A

Can be difficult

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What can searching sequence databases for similar sequences predict about novel sequences?

A

Possible functions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What can alignments of sequences contain?

A

Mismatches and gaps

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How are mismatches and gaps interpreted in sequence alignments?

A

As substitutions and indels respectively

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What do alignment algorithms ideally try to identify about sequences?

A

The most likely evolutionary ‘path’ between sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are databases?

A

Searchable collections of information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What does how we search databases depend on?

A

Database access, design and location

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What does the quantification of sequence similarity require?

A

Alignment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the constant gap penalty?

A

Opening a gap of any size attracts a constant (a) negative score
= -a

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the proportional gap penalty?

A

Opening a gap attracts a penalty proportional to its length (L)
= -(aL)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the affine gap penalty?

A

Opening a gap attracts a constant (a), extending it attracts a penalty (b) proportional to the gap’s length (L)
= -(a+bL) where a»b

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What type of gap penalty is generally the most relevant biologically?

A

Affine

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What does the choice of gap penalty depend on?

A

Software

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What do amino acid side chains share?

A

Chemical properties (acidic/basic etc.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is the accepted theory about amino acid substitutions?

A

Chemically similar amino acids substitute more readily than chemically dissimilar amino acids

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is ‘built into’ amino acid substitution matrices?

A

Physico-chemical classification of amino acids

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

In the PAM250 (accepted point mutation) substitution matrix, what do similar amino acids score?

A

+ve score

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

In the PAM250 (accepted point mutation) substitution matrix, what do dissimilar amino acids score?

A

-ve score

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is the PAM250 (accepted point mutation) substitution matrix based on alignments of?

A

Closely related proteins

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What are accepted point mutation substitution matrices extrapolated to?

A

Large (PAM120) and very large (PAM250) evolutionary distances

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is BLOSUM62 (blocks substitution matrix) based on alignments of?

A

Gap free alignments of short protein motifs (blocks)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What do the numbers represent in BLOSUM62 (blocks substitution matrix)?

A

Level of identity in alignments (BLOSUM62 = 62%)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

BLOSUM is continually updated, true or false?

A

False, it is no longer updated but is still widely used

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

BLOSUM62 (blocks substitution matrix) has no extrapolation. What does this mean for distant relationships?

A

More reliable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is the default amino acid substitution matrix of BLAST?

A

BLOSUM62 (blocks substitution matrix)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What are the types of BLAST searches and their uses?

A

Nucleotide query versus nucleotide database, i.e. what gene is this?
Protein query versus protein database, i.e. what protein is this?
Translated nucleotide query versus protein database, i.e. does this DNA sequence code for a known protein?
Protein query versus translated nucleotide database, i.e. can we identify a DNA sequence that might encode this protein?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What are the 4 sections in the results page of a BLAST search?

A
Search information (including RID)
Graphical summary (conserved domain search)
Results table (hyperlinked to alignments)
Alignments (download links)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Databases contain more information than can be searched practically by observation, true or false?

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Most databases are relational. What does this mean?

A

The data are organised into table with defined inter-relationships

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Does the manual querying of remote databases require specialist knowledge?

A

Little

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

What might automated querying of local databases enable?

A

Greater throughput and flexibility

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Cheaper hardware for databases can increase locally available resources but what does it also make quite costly?

A

Administration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What do highly similar (>80%) sequences probably share?

A

A common ancestor and thus probably are homologous

46
Q

Why is it necessary to quantify similarity of sequences?

A

To distinguish between ‘chance’ and ‘real’ similarity (common ancestry)

47
Q

What are alignments generally interpreted in terms of?

A

An explicit model of molecular evolution (substitutions and indels)

48
Q

Sequences are either homologous or not, true or false?

A

True, i.e. they either share a common ancestor or they do not

49
Q

Any two sequences can show similarity simply by ‘chance’, true or false?

A

True

50
Q

Is the alignment of pairs of long sequences computationally easy or intensive?

A

Intensive due to a large number of equivalences

51
Q

What do dynamic programming algorithms (DPAs) theoretically allow?

A

Exhaustive identification of optimal alignments

52
Q

Are DPA methods slow or fast for searching large databases?

A

Often considered too slow

53
Q

What can DPA methods find?

A

Optimal global or local alignments

54
Q

Give an example of a global algorithm

A

Needleman-Wunsch

55
Q

What do global algorithms try to find?

A

Similarity over the whole sequence

56
Q

Give an example of a local algorithm

A

Smith-Waterman

57
Q

What do local alignments try to find?

A

Just local regions

They do not try to align the whole sequence

58
Q

Which is the most biologically relevant, global or local alignment?

A

Local alignment

59
Q

Different algorithms will give similar alignments of the same sequences, true or false?

A

False, different algorithms can give very different alignments of the same sequences

60
Q

What are the pros and cons of exhaustive algorithms to find optimal alignments?

A

Accurate but slow

61
Q

Genes with shared common ancestors are homologous and so may be what?

A

Very similar in sequence

62
Q

Do exhausative mathematical algorithms find the alignment with the maximum or minimum similarity?

A

Maximum

63
Q

What does distinguishing between ‘real’ and ‘chance’ alignments require?

A

The comparison of some measure of similarity

64
Q

What do scoring procedures need to take into account?

A

Biology

65
Q

How can we measure similarity of sequences?

A

By scoring the alignment

66
Q

What do E-values enable?

A

Comparison between searches and provide an objective threshold for significance

67
Q

What can protein sequences identify?

A

More distant evolutionary relationships (depending on scoring matrix)

68
Q

What can DNA searches find?

A

Evolutionary close relationships

69
Q

BLAST is fast, but what might it miss?

A

Optimal alignments

70
Q

DNA and protein databases can only be searched effectively using what algorithms?

A

Heuristic algorithms (speed)

71
Q

How can we compare results when using different methods for alignment?

A

Using p values and E values

72
Q

What does E value stand for?

A

The expect value

73
Q

What do E values indicate?

A

How often a match at a given p value would be expected to occur in the database, by chance (i.e. when the sequences are unrelated)

74
Q

Should ‘real’ alignments between sequences that do share a common ancestor show more or less similarity than ‘chance’ alignments of sequences that do not?

A

Greater similarity

75
Q

What are the pros and cons of maximisation scores?

A

They ensure the best alignment but are not comparable

76
Q

Are default scores and comparing alignments good to use?

A

Generally good, but experimentation is possible

77
Q

What does the percentage of identically aligned residues allow?

A

Comparison of different alignments

78
Q

How is the percentage of identically aligned residues calculated?

A

The number of matches by the length of the alignment and multiplied by 100

79
Q

When scoring alignments, what is the assumption about nucleotides (G, A, T, C)?

A

They substitute equally for one another

80
Q

What type of selection are protein sequences under for structure and function?

A

Stabilising selection

81
Q

Are changes between chemically similar amino acids more or less likely to be deleterious?

A

Less likely

82
Q

When scoring protein alignments, particular substitutions can have different scores. What does this depend upon?

A

Their chemical similarity, e.g. LEU to ILE or PRO to TRP

83
Q

Almost any 2 sequences can be aligned by using what?

A

Gaps (indels)

84
Q

How is the quality of an alignment assessed?

A

Using a scoring matrix

Matches are +ve, mismatches are 0, gaps are -ve

85
Q

What do heavy gap penalties help the discovery of?

A

Biologically meaningful alignments

86
Q

What do gaps in a query represent?

A

Insertion/deletion events, relative to ancestor

87
Q

How are mismatches (substitutions) in an alignment scored?

A

Equivalent

G>T = G>C = -1

88
Q

Algorithms maximise the score of indels but what does biology indicate about indels?

A

That they should be rare

89
Q

What are the pros and cons of heuristic algorithms?

A

They are not guaranteed to find the best alignment but are much faster (by 5-20x)

90
Q

What is the main problem with DPAs?

A

They are too slow for large databases

91
Q

What does sequence database searching involve?

A

Aligning query to all database sequences and ranking ‘hits’ by similarity/quality

92
Q

What is the main advantage to using DPAs?

A

They are guaranteed to find the highest scoring alignment (but biological reality?)

93
Q

What does alignment allow?

A

Quantification of similarity between sequences

94
Q

What does the statistical theory of alignment enable?

A

P values to be calculated

95
Q

What is a p value?

A

The probability of observing as high scoring alignment between two unrelated sequences of similar length and composition

96
Q

What p value indicates a significant match?

A

p<0.05

97
Q

Why are E values greater than 0.01 unlikely to reflect real matches?

A

Since as good a match would be expected to occur at least 1% of the time, even for unrelated sequences

98
Q

E values are always calculated in the same way, true or false?

A

False

99
Q

How are E values calculated for BLAST?

A

E = pX

Where X is the total length of all the sequences in the database (treats the database as one large sequence)

100
Q

How are E values calculated for FASTA?

A

E = pN

Where N is the number of sequences

101
Q

What is the default word size (W) for BLAST?

A

3 for proteins

11 for DNA

102
Q

BLAST searches only for word matches above what?

A

A threshold, T

103
Q

What did early versions of BLAST only allow for?

A

Ungapped alignments (but T allows mismatches)

104
Q

What happens to high-scoring segment pairs (HSPs) along the BLAST query?

A

They are reported and ordered by score

105
Q

What do current versions of BLAST allow which improves its sensitivity?

A

Gapped alignments

106
Q

In BLAST searches, all matches above T are extended until what?

A

The introduction of gaps causes the alignment score to fall quickly

107
Q

Alignments are scored in order to quantify what?

A

Similarity

108
Q

What are the most common heuristic algorithms?

A

FASTA and BLAST

109
Q

Initial alignment hits are examined to see if they can be what?

A

Extended

110
Q

Generally speaking, what do heuristic methods do?

A

Break the query into short ‘words’ and look for matches above a threshold level in the subject database

111
Q

What do DPAs find?

A

A computationally optimal solution to the alignment problem

112
Q

What do heuristic methods assume?

A

That high scoring alignments contain short regions of exact matches