Bioinformatics Flashcards
What is bioinformatics ?
The application of computers to problems in Biology :
Aiding the Biologist in creating, storing, searching and analyzing biological data (particularly sequences and
structures), presenting it in a way Biologists can use and applying the analysis to make predictions.
In what was is biology overwhelmed with data ?
Human genome ~ 3.2Gbp Raw data to fill ~ 5 CDs DNA sequence databanks > 2.482 x 10^12 bp (passed 100Gbp Aug 2005) i.e. ~579 HUGEs (was 425 in Oct 15, 579 in Oct 16 ) or ~806 years’ issues of the New York Times > 703,140,000 entries (Oct 2017) > 124,500 protein structures (Oct 2017)
How often does DNA data double ?
What about structure data ?
DNA data doubles every ~ 1.5-3 years
Structure data doubles every ~ 6 years
What is the content of the PDB (Protein Data Bank) ?
PDB contains structural data :
> 112,500 x-ray
> 10,500 NMR
~ 52,300 sequences (<95% SeqID)
How big is the human genome ?
How much of it actually codes for protein ?
How much of it is repeated sequences ?
How many genes undergo alternative splicing ?
~ 3.2Gbp 30x C. elegans or D. melanogaster < 5% coding > 50% repeated sequences ~ 35% genes have alternative splicing
What is a database ?
Database : A structured collection of data with some
tool enabling it to be ‘queried’
What is a databank ?
Databank : A collection of data (normally in simple text
files) without an associated query tool
What are the three types of databanks ?
Primary : Raw sequence/structure data, possibly with detailed annotations
Secondary : Derived data - sequence profiles, etc, generally highly annotated
Meta-databanks : collections of links between databanks and
databases
Give examples of primary databanks.
Genbank, EMBL, DDBJ, UniProtKB/Swissprot, PIR
- Simply contain sequence data (DNA or protein)
- May also have ‘feature’ information (splice sites, signal sequences, disulphides, active sites, etc., etc.)
- DNA databanks may also contain translations (known or predicted)
How can we be sure about a protein identified from genome data ?
We can’t !
Since gene-prediction methods are imperfect, a protein identified from genome data is hypothetical until verified by experiment.
What does Enzyme contain ?
Enzyme classifications (EC numbers).
Give examples of secondary databanks.
PROSITE, PRINTS, BLOCKS, INTERPRO
These contain derived information patterns that characterize a protein family + detailed annotation
How can we carry out similarity measures ?
Simple identity scoring -1 for a match; 0 for a mismatch
More complex scoring schemes –> ccount for similarity
between amino acids –> typically derived from analysis of
aligned homologous proteins - which substitutions are observed
First done by Dayhoff (1978) + improved by Henikoff & Henikoff (1992) - BLOSUM matrices
Who introduced dynamic prgramming (global + local) and when ?
Introduced by Needleman & Wunsch(1970) - global
Formalized by Smith & Waterman (1981) - local
How do fast methods (FAST/BLAST) work ?
Approximate fast methods (‘heuristics’)
Index the database by finding locations of short ‘words’
Take ‘words’ from the probe sequence and look them up in the index
Look for multiple matches and extend to find likely hits to full alignment