Bioinformatics Flashcards
Define bioinformatics
the application of computers to problems in biology
Define bioinformatics
the application of computers to problems in biology
What is the aim of bioinformatics?
Based on known protein structure and function, to enable understanding and modulation of protein function
How often does DNA data double?
every ~18 months
How often does structure data double?
every ~6 years
How big is the human genome?
3.2 Gbp
What percentage of the human genome is coding?
What percentage of the human genome is repeated sequences?
> 50%
What percentage of genes have alternative splicing?
~35%
What is a database?
a structured collection of data with some tool enabling it to be ‘queried’
What is a databank?
a collection of data (normally in simple text files) without an associated query tool
What types of databank are there?
primary, secondary and meta-databanks
What is a primary databank?
- simply contain sequence data (DNA or protein)
- may also have ‘feature’ information (splice sites, signal sequences, disulphides, actives sites, etc.)
- DNA databanks may also contain translations (known or predicted
What is a meta-databank?
collections of links between databanks and databases
Give some examples of primary databanks
Genbank, EMBL, DDBJ, UniProtKB/SwissProt, PIR, PDB, Enzyme
What is the implication of imperfect gene-prediction methods?
a protein identified from genome data is hypothetical until verified by experiment
What information is found in PDB?
structural data
What information is found in Enzyme?
enzyme classifications (EC numbers)
What is a secondary databank?
- these contain derived information
- patterns that characterise a protein family
- detailed annotation
What is a secondary databank?
- these contain derived information
- patterns that characterise a protein family
- detailed annotation
What is the aim of bioinformatics?
Based on known protein structure and function, to enable understanding and modulation of protein function
How often does DNA data double?
every ~18 months
How often does structure data double?
every ~6 years
What is the meaning of some characters and symbols used in PROSITE?
- the standard IUPAC one letter code for the amino acids is used
- the symbol ‘x’ is used for a position where any amino acid is accepted
- [ALT] stands for Ala or Leu or Thr
- {AM} stabds for any amino acid except Ala and Met
- each element in a pattern is separated from its neighbour by ‘-‘
- x(3) corresponds to x-x-x
- x(2,4) corresponds to x-x or x-x-x or x-x-x-x
What is a dotplot
a graphical method that allows the comparison of two biological sequences and identification of regions of close similarity between them
What is similarity measure?
Similarity matrices are used to align sequences of nucleic acids or amino acids
What is scored for a match; and what for a mismatch?
1 for a match; 0 for a mismatch
What is an example of a more complex scoring system?
a more complicated matrix would give a higher score to transitions (pyrimidine to pyrimidine or purine to purine) than to transversions (pyrimidine to purine or vice versa); the match/mismatch ratio of the matrix sets the target evolutionary distance
What is the Needleman-Wunsch algorithm?
an algorithm used to align protein or nucleotide sequences; one of the first applications of dynamic programming to compare biological sequences
When were similarity measures first carried out?
by Dayhoff in 1978
When were similarity measures improved?
by Henrikoff and Henrikoff in 1992 by use of BLOSUM matrices
When was dynamic programming invented?
introduced by Needleman and Wunsch (global) in 1970 and formalised by Smith and Waterman (local) in 1981
Give some examples of primary databanks
Genbank, EMBL, DDBJ, UniProtKB/SwissProt, PIR, PDB, Enzyme
What is the implication of imperfect gene-prediction methods?
a protein identified from genome data is hypothetical until verified by experiment
What information is found in PDB?
structural data
What information is found in Enzyme?
enzyme classifications (EC numbers)
Give some examples of secondary databanks
PROSITE, PRINTS, BLOCKS, INTERPRO
What is an algorithm?
A complete and precise set of steps that will solve a problem and achieve an identical result whenever given the same set of data to a defined level of accuracy.
What is PROSITE?
PROSITE is a protein database consisting of entries describing the protein families, domains and functional sites, as well as amino acid patterns and profiles in them.
What is the PROSITE pattern for a protein kinase C phosphorylation site?
[ST]-x-[RK]
What is the PROSITE pattern for N-linked glycosylation?
N-{P}-[ST]-{P}
What is the PROSITE pattern for the Kringle domain?
[FY]-C-[RH]-[NS]-x(7,8)-[WY]-C
What is a dotplot
a graphical method that allows the comparison of two biological sequences and identification of regions of close similarity between them
What is scored for a match; and what for a mismatch?
1 for a match; 0 for a mismatch
Define annotation
a subfield in the general field of genome analysis, which includes anything that can be done with genome sequences by computational means
Why might methods be imperfect?
- a coding region may be missed
- an incomplete protein may be reported
- splicing may be predicted incorrectly
- coding regions may overlap
- exon assembly (splicing) may be different in different tissues
- some apparent coding sequences may be defective or not expressed
When was dynamic programming invented?
introduced by Needleman and Wunsch (global) in 1970 and formalised by Smith and Waterman (local) in 1981
What is meant by heuristics?
approximate fast methods
What does heuristics entail?
- index the database by finding locations of short ‘words’
- take ‘words’ from the probe sequence and look them up in the index
- look for multiple matches and extend to find likely hits to full alignment
How is DNA sequenced?
by the Sanger method: di-deoxy chain termination
How does this apply to sequencing entire genomes?
each segment
What is meant by heuristics?
approximate fast methods
What does heuristics entail?
- index the database by finding locations of short ‘words’
- take ‘words’ from the probe sequence and look them up in the index
- look for multiple matches and extend to find likely hits to full alignment
How is DNA sequenced?
by the Sanger method: di-deoxy chain termination
How does this apply to sequencing entire genomes?
each segment
What is fragment assembly?
aligning and merging fragments from a longer DNA sequence in orfer to reconstruct the original sequence
What is an algorithm?
A complete and precise set of steps that will solve a problem and achieve an identical result whenever given the same set of data to a defined level of accuracy.
What are two approaches to sequencing eukaryotes?
- detect similarity with known coding regions
2. ab initio methods; make predictions based on typical features
What are ESTs?
expressed sequence tags; short subsequences of a cDNA sequence, used to identify gene transcripts and instrumental in gene discovery and in gene-sequence determination
What are some typical features used in ab initio methods?
initial 5' exon (transcription start point with upstream promoter; ends immediately before a GT splice signal) internal exons (begins after AG; ends before a GT splice signal) final 3' exon (begins after AG splice signal; ends with stop codon and poly-A tail)
How do computers deal with this information?
machine learning methods; a general class of computer software which learns from examples and is then able to make predictions
What are some examples of these methods?
- artificial neural networks
- support vector machines
- decision trees
- naive Bayesian classifiers
What are artificial neural networks (ANNs)?
- family of models inspired by biological neural networks
- used to estimate or approximate functions that can depend on a large number of inputs and are generally unknown
Why might methods be imperfect?
- a coding region may be missed
- an incomplete protein may be reported
- splicing may be predicted incorrectly
- coding regions may overlap
- exon assembly (splicing) may be different in different tissues
- some apparent coding sequences may be defective or not expressed
Explain quality
the quality of raw data is as good as the methods that produce it
the quality of annotations is as good as the curators
Explain quality
the quality of raw data is as good as the methods that produce it
the quality of annotations is as good as the curators