Bioinformatics Flashcards

1
Q

Define bioinformatics

A

the application of computers to problems in biology

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Define bioinformatics

A

the application of computers to problems in biology

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the aim of bioinformatics?

A

Based on known protein structure and function, to enable understanding and modulation of protein function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How often does DNA data double?

A

every ~18 months

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How often does structure data double?

A

every ~6 years

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How big is the human genome?

A

3.2 Gbp

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What percentage of the human genome is coding?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What percentage of the human genome is repeated sequences?

A

> 50%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What percentage of genes have alternative splicing?

A

~35%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a database?

A

a structured collection of data with some tool enabling it to be ‘queried’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a databank?

A

a collection of data (normally in simple text files) without an associated query tool

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What types of databank are there?

A

primary, secondary and meta-databanks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a primary databank?

A
  • simply contain sequence data (DNA or protein)
  • may also have ‘feature’ information (splice sites, signal sequences, disulphides, actives sites, etc.)
  • DNA databanks may also contain translations (known or predicted
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a meta-databank?

A

collections of links between databanks and databases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Give some examples of primary databanks

A

Genbank, EMBL, DDBJ, UniProtKB/SwissProt, PIR, PDB, Enzyme

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the implication of imperfect gene-prediction methods?

A

a protein identified from genome data is hypothetical until verified by experiment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What information is found in PDB?

A

structural data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What information is found in Enzyme?

A

enzyme classifications (EC numbers)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is a secondary databank?

A
  • these contain derived information
  • patterns that characterise a protein family
  • detailed annotation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is a secondary databank?

A
  • these contain derived information
  • patterns that characterise a protein family
  • detailed annotation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the aim of bioinformatics?

A

Based on known protein structure and function, to enable understanding and modulation of protein function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How often does DNA data double?

A

every ~18 months

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How often does structure data double?

A

every ~6 years

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the meaning of some characters and symbols used in PROSITE?

A
  • the standard IUPAC one letter code for the amino acids is used
  • the symbol ‘x’ is used for a position where any amino acid is accepted
  • [ALT] stands for Ala or Leu or Thr
  • {AM} stabds for any amino acid except Ala and Met
  • each element in a pattern is separated from its neighbour by ‘-‘
  • x(3) corresponds to x-x-x
  • x(2,4) corresponds to x-x or x-x-x or x-x-x-x
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is a dotplot

A

a graphical method that allows the comparison of two biological sequences and identification of regions of close similarity between them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is similarity measure?

A

Similarity matrices are used to align sequences of nucleic acids or amino acids

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is scored for a match; and what for a mismatch?

A

1 for a match; 0 for a mismatch

28
Q

What is an example of a more complex scoring system?

A

a more complicated matrix would give a higher score to transitions (pyrimidine to pyrimidine or purine to purine) than to transversions (pyrimidine to purine or vice versa); the match/mismatch ratio of the matrix sets the target evolutionary distance

29
Q

What is the Needleman-Wunsch algorithm?

A

an algorithm used to align protein or nucleotide sequences; one of the first applications of dynamic programming to compare biological sequences

30
Q

When were similarity measures first carried out?

A

by Dayhoff in 1978

31
Q

When were similarity measures improved?

A

by Henrikoff and Henrikoff in 1992 by use of BLOSUM matrices

32
Q

When was dynamic programming invented?

A

introduced by Needleman and Wunsch (global) in 1970 and formalised by Smith and Waterman (local) in 1981

33
Q

Give some examples of primary databanks

A

Genbank, EMBL, DDBJ, UniProtKB/SwissProt, PIR, PDB, Enzyme

34
Q

What is the implication of imperfect gene-prediction methods?

A

a protein identified from genome data is hypothetical until verified by experiment

35
Q

What information is found in PDB?

A

structural data

36
Q

What information is found in Enzyme?

A

enzyme classifications (EC numbers)

37
Q

Give some examples of secondary databanks

A

PROSITE, PRINTS, BLOCKS, INTERPRO

38
Q

What is an algorithm?

A

A complete and precise set of steps that will solve a problem and achieve an identical result whenever given the same set of data to a defined level of accuracy.

39
Q

What is PROSITE?

A

PROSITE is a protein database consisting of entries describing the protein families, domains and functional sites, as well as amino acid patterns and profiles in them.

40
Q

What is the PROSITE pattern for a protein kinase C phosphorylation site?

A

[ST]-x-[RK]

41
Q

What is the PROSITE pattern for N-linked glycosylation?

A

N-{P}-[ST]-{P}

42
Q

What is the PROSITE pattern for the Kringle domain?

A

[FY]-C-[RH]-[NS]-x(7,8)-[WY]-C

43
Q

What is a dotplot

A

a graphical method that allows the comparison of two biological sequences and identification of regions of close similarity between them

44
Q

What is scored for a match; and what for a mismatch?

A

1 for a match; 0 for a mismatch

45
Q

Define annotation

A

a subfield in the general field of genome analysis, which includes anything that can be done with genome sequences by computational means

46
Q

Why might methods be imperfect?

A
  • a coding region may be missed
  • an incomplete protein may be reported
  • splicing may be predicted incorrectly
  • coding regions may overlap
  • exon assembly (splicing) may be different in different tissues
  • some apparent coding sequences may be defective or not expressed
47
Q

When was dynamic programming invented?

A

introduced by Needleman and Wunsch (global) in 1970 and formalised by Smith and Waterman (local) in 1981

48
Q

What is meant by heuristics?

A

approximate fast methods

49
Q

What does heuristics entail?

A
  • index the database by finding locations of short ‘words’
  • take ‘words’ from the probe sequence and look them up in the index
  • look for multiple matches and extend to find likely hits to full alignment
50
Q

How is DNA sequenced?

A

by the Sanger method: di-deoxy chain termination

51
Q

How does this apply to sequencing entire genomes?

A

each segment

52
Q

What is meant by heuristics?

A

approximate fast methods

53
Q

What does heuristics entail?

A
  • index the database by finding locations of short ‘words’
  • take ‘words’ from the probe sequence and look them up in the index
  • look for multiple matches and extend to find likely hits to full alignment
54
Q

How is DNA sequenced?

A

by the Sanger method: di-deoxy chain termination

55
Q

How does this apply to sequencing entire genomes?

A

each segment

56
Q

What is fragment assembly?

A

aligning and merging fragments from a longer DNA sequence in orfer to reconstruct the original sequence

57
Q

What is an algorithm?

A

A complete and precise set of steps that will solve a problem and achieve an identical result whenever given the same set of data to a defined level of accuracy.

58
Q

What are two approaches to sequencing eukaryotes?

A
  1. detect similarity with known coding regions

2. ab initio methods; make predictions based on typical features

59
Q

What are ESTs?

A

expressed sequence tags; short subsequences of a cDNA sequence, used to identify gene transcripts and instrumental in gene discovery and in gene-sequence determination

60
Q

What are some typical features used in ab initio methods?

A
initial 5' exon (transcription start point with upstream promoter; ends immediately before a GT splice signal)
internal exons (begins after AG; ends before a GT splice signal)
final 3' exon (begins after AG splice signal; ends with stop codon and poly-A tail)
61
Q

How do computers deal with this information?

A

machine learning methods; a general class of computer software which learns from examples and is then able to make predictions

62
Q

What are some examples of these methods?

A
  • artificial neural networks
  • support vector machines
  • decision trees
  • naive Bayesian classifiers
63
Q

What are artificial neural networks (ANNs)?

A
  • family of models inspired by biological neural networks

- used to estimate or approximate functions that can depend on a large number of inputs and are generally unknown

64
Q

Why might methods be imperfect?

A
  • a coding region may be missed
  • an incomplete protein may be reported
  • splicing may be predicted incorrectly
  • coding regions may overlap
  • exon assembly (splicing) may be different in different tissues
  • some apparent coding sequences may be defective or not expressed
65
Q

Explain quality

A

the quality of raw data is as good as the methods that produce it

the quality of annotations is as good as the curators

66
Q

Explain quality

A

the quality of raw data is as good as the methods that produce it

the quality of annotations is as good as the curators