Bioinformatics Flashcards

1
Q

Define bioinformatics

A

the application of computers to problems in biology

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Define bioinformatics

A

the application of computers to problems in biology

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the aim of bioinformatics?

A

Based on known protein structure and function, to enable understanding and modulation of protein function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How often does DNA data double?

A

every ~18 months

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How often does structure data double?

A

every ~6 years

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How big is the human genome?

A

3.2 Gbp

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What percentage of the human genome is coding?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What percentage of the human genome is repeated sequences?

A

> 50%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What percentage of genes have alternative splicing?

A

~35%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a database?

A

a structured collection of data with some tool enabling it to be ‘queried’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a databank?

A

a collection of data (normally in simple text files) without an associated query tool

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What types of databank are there?

A

primary, secondary and meta-databanks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a primary databank?

A
  • simply contain sequence data (DNA or protein)
  • may also have ‘feature’ information (splice sites, signal sequences, disulphides, actives sites, etc.)
  • DNA databanks may also contain translations (known or predicted
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a meta-databank?

A

collections of links between databanks and databases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Give some examples of primary databanks

A

Genbank, EMBL, DDBJ, UniProtKB/SwissProt, PIR, PDB, Enzyme

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the implication of imperfect gene-prediction methods?

A

a protein identified from genome data is hypothetical until verified by experiment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What information is found in PDB?

A

structural data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What information is found in Enzyme?

A

enzyme classifications (EC numbers)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is a secondary databank?

A
  • these contain derived information
  • patterns that characterise a protein family
  • detailed annotation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is a secondary databank?

A
  • these contain derived information
  • patterns that characterise a protein family
  • detailed annotation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the aim of bioinformatics?

A

Based on known protein structure and function, to enable understanding and modulation of protein function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How often does DNA data double?

A

every ~18 months

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How often does structure data double?

A

every ~6 years

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the meaning of some characters and symbols used in PROSITE?

A
  • the standard IUPAC one letter code for the amino acids is used
  • the symbol ‘x’ is used for a position where any amino acid is accepted
  • [ALT] stands for Ala or Leu or Thr
  • {AM} stabds for any amino acid except Ala and Met
  • each element in a pattern is separated from its neighbour by ‘-‘
  • x(3) corresponds to x-x-x
  • x(2,4) corresponds to x-x or x-x-x or x-x-x-x
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is a dotplot
a graphical method that allows the comparison of two biological sequences and identification of regions of close similarity between them
26
What is similarity measure?
Similarity matrices are used to align sequences of nucleic acids or amino acids
27
What is scored for a match; and what for a mismatch?
1 for a match; 0 for a mismatch
28
What is an example of a more complex scoring system?
a more complicated matrix would give a higher score to transitions (pyrimidine to pyrimidine or purine to purine) than to transversions (pyrimidine to purine or vice versa); the match/mismatch ratio of the matrix sets the target evolutionary distance
29
What is the Needleman-Wunsch algorithm?
an algorithm used to align protein or nucleotide sequences; one of the first applications of dynamic programming to compare biological sequences
30
When were similarity measures first carried out?
by Dayhoff in 1978
31
When were similarity measures improved?
by Henrikoff and Henrikoff in 1992 by use of BLOSUM matrices
32
When was dynamic programming invented?
introduced by Needleman and Wunsch (global) in 1970 and formalised by Smith and Waterman (local) in 1981
33
Give some examples of primary databanks
Genbank, EMBL, DDBJ, UniProtKB/SwissProt, PIR, PDB, Enzyme
34
What is the implication of imperfect gene-prediction methods?
a protein identified from genome data is hypothetical until verified by experiment
35
What information is found in PDB?
structural data
36
What information is found in Enzyme?
enzyme classifications (EC numbers)
37
Give some examples of secondary databanks
PROSITE, PRINTS, BLOCKS, INTERPRO
38
What is an algorithm?
A complete and precise set of steps that will solve a problem and achieve an identical result whenever given the same set of data to a defined level of accuracy.
39
What is PROSITE?
PROSITE is a protein database consisting of entries describing the protein families, domains and functional sites, as well as amino acid patterns and profiles in them.
40
What is the PROSITE pattern for a protein kinase C phosphorylation site?
[ST]-x-[RK]
41
What is the PROSITE pattern for N-linked glycosylation?
N-{P}-[ST]-{P}
42
What is the PROSITE pattern for the Kringle domain?
[FY]-C-[RH]-[NS]-x(7,8)-[WY]-C
43
What is a dotplot
a graphical method that allows the comparison of two biological sequences and identification of regions of close similarity between them
44
What is scored for a match; and what for a mismatch?
1 for a match; 0 for a mismatch
45
Define annotation
a subfield in the general field of genome analysis, which includes anything that can be done with genome sequences by computational means
46
Why might methods be imperfect?
- a coding region may be missed - an incomplete protein may be reported - splicing may be predicted incorrectly - coding regions may overlap - exon assembly (splicing) may be different in different tissues - some apparent coding sequences may be defective or not expressed
47
When was dynamic programming invented?
introduced by Needleman and Wunsch (global) in 1970 and formalised by Smith and Waterman (local) in 1981
48
What is meant by heuristics?
approximate fast methods
49
What does heuristics entail?
- index the database by finding locations of short 'words' - take 'words' from the probe sequence and look them up in the index - look for multiple matches and extend to find likely hits to full alignment
50
How is DNA sequenced?
by the Sanger method: di-deoxy chain termination
51
How does this apply to sequencing entire genomes?
each segment
52
What is meant by heuristics?
approximate fast methods
53
What does heuristics entail?
- index the database by finding locations of short 'words' - take 'words' from the probe sequence and look them up in the index - look for multiple matches and extend to find likely hits to full alignment
54
How is DNA sequenced?
by the Sanger method: di-deoxy chain termination
55
How does this apply to sequencing entire genomes?
each segment
56
What is fragment assembly?
aligning and merging fragments from a longer DNA sequence in orfer to reconstruct the original sequence
57
What is an algorithm?
A complete and precise set of steps that will solve a problem and achieve an identical result whenever given the same set of data to a defined level of accuracy.
58
What are two approaches to sequencing eukaryotes?
1. detect similarity with known coding regions | 2. ab initio methods; make predictions based on typical features
59
What are ESTs?
expressed sequence tags; short subsequences of a cDNA sequence, used to identify gene transcripts and instrumental in gene discovery and in gene-sequence determination
60
What are some typical features used in ab initio methods?
``` initial 5' exon (transcription start point with upstream promoter; ends immediately before a GT splice signal) internal exons (begins after AG; ends before a GT splice signal) final 3' exon (begins after AG splice signal; ends with stop codon and poly-A tail) ```
61
How do computers deal with this information?
machine learning methods; a general class of computer software which learns from examples and is then able to make predictions
62
What are some examples of these methods?
- artificial neural networks - support vector machines - decision trees - naive Bayesian classifiers
63
What are artificial neural networks (ANNs)?
- family of models inspired by biological neural networks | - used to estimate or approximate functions that can depend on a large number of inputs and are generally unknown
64
Why might methods be imperfect?
- a coding region may be missed - an incomplete protein may be reported - splicing may be predicted incorrectly - coding regions may overlap - exon assembly (splicing) may be different in different tissues - some apparent coding sequences may be defective or not expressed
65
Explain quality
the quality of raw data is as good as the methods that produce it the quality of annotations is as good as the curators
66
Explain quality
the quality of raw data is as good as the methods that produce it the quality of annotations is as good as the curators