Bioinformatics Flashcards
What is bioinformatics ?
The application of computers to problems in Biology :
Aiding the Biologist in creating, storing, searching and analyzing biological data (particularly sequences and
structures), presenting it in a way Biologists can use and applying the analysis to make predictions.
In what was is biology overwhelmed with data ?
Human genome ~ 3.2Gbp Raw data to fill ~ 5 CDs DNA sequence databanks > 2.482 x 10^12 bp (passed 100Gbp Aug 2005) i.e. ~579 HUGEs (was 425 in Oct 15, 579 in Oct 16 ) or ~806 years’ issues of the New York Times > 703,140,000 entries (Oct 2017) > 124,500 protein structures (Oct 2017)
How often does DNA data double ?
What about structure data ?
DNA data doubles every ~ 1.5-3 years
Structure data doubles every ~ 6 years
What is the content of the PDB (Protein Data Bank) ?
PDB contains structural data :
> 112,500 x-ray
> 10,500 NMR
~ 52,300 sequences (<95% SeqID)
How big is the human genome ?
How much of it actually codes for protein ?
How much of it is repeated sequences ?
How many genes undergo alternative splicing ?
~ 3.2Gbp 30x C. elegans or D. melanogaster < 5% coding > 50% repeated sequences ~ 35% genes have alternative splicing
What is a database ?
Database : A structured collection of data with some
tool enabling it to be ‘queried’
What is a databank ?
Databank : A collection of data (normally in simple text
files) without an associated query tool
What are the three types of databanks ?
Primary : Raw sequence/structure data, possibly with detailed annotations
Secondary : Derived data - sequence profiles, etc, generally highly annotated
Meta-databanks : collections of links between databanks and
databases
Give examples of primary databanks.
Genbank, EMBL, DDBJ, UniProtKB/Swissprot, PIR
- Simply contain sequence data (DNA or protein)
- May also have ‘feature’ information (splice sites, signal sequences, disulphides, active sites, etc., etc.)
- DNA databanks may also contain translations (known or predicted)
How can we be sure about a protein identified from genome data ?
We can’t !
Since gene-prediction methods are imperfect, a protein identified from genome data is hypothetical until verified by experiment.
What does Enzyme contain ?
Enzyme classifications (EC numbers).
Give examples of secondary databanks.
PROSITE, PRINTS, BLOCKS, INTERPRO
These contain derived information patterns that characterize a protein family + detailed annotation
How can we carry out similarity measures ?
Simple identity scoring -1 for a match; 0 for a mismatch
More complex scoring schemes –> ccount for similarity
between amino acids –> typically derived from analysis of
aligned homologous proteins - which substitutions are observed
First done by Dayhoff (1978) + improved by Henikoff & Henikoff (1992) - BLOSUM matrices
Who introduced dynamic prgramming (global + local) and when ?
Introduced by Needleman & Wunsch(1970) - global
Formalized by Smith & Waterman (1981) - local
How do fast methods (FAST/BLAST) work ?
Approximate fast methods (‘heuristics’)
Index the database by finding locations of short ‘words’
Take ‘words’ from the probe sequence and look them up in the index
Look for multiple matches and extend to find likely hits to full alignment
How is DNA sequenced ?
How does this apply to entire genomes ? i.e, LONG stretches of DNA
Sequencing : Sanger method: di-deoxy chain termination + ‘Nextgen’ methods
Whole genome : use shotgun sequencing = shatter genome into segments < 2000 bp + assemble fragment w/ computer science (algorithm)
What is an algorithm ?
A complete and precise set of steps that will solve a problem and achieve an identical result whenever given the same set of data to a defined level of accuracy.
- Ordered steps
- Repeatable
- Known/defined accuracy
How can we program an algorithm for fragment assembly ?
What problems do we encounter ?
- Work down from a big overlap window to a small one
- Enforce a minimum overlap size
- Fuzzy matches - account for errors in sequencing
- Apply a confidence score
- Problems with sequence repeats
Why is it tricky to use bioinformatics to model translation ?
Computers are ideal for boring repetitive tasks like performing translation to a protein sequence, but not all DNA codes for protein: control regions + junk
How can we model translation in prokaryotes ?
Find a start codon –> continue to a stop codon (though not all ORFs are used)
What are the problems with modeling translation in eukaryotes ?
- Transcription start sites are not obvious
- Introns interrupt the genes and get
spliced out of the mRNA
What are the 2 main approaches for modelling translation in eukaryotes ?
- Detect similarity with known coding regions :
- Regions similar to known proteins
- Regions that map to ‘Expressed Sequence Tags’ (ESTs) - Ab initio methods
- Make predictions based on typical features (GT/AG splice signals, sequence composition, etc.)
Initial 5’ exon
- Transcription start point with upstream promoter + ends immediately before a GT splice signal
Internal exons
- Begin after AG ; end before a GT splice signal
Final 3’ exon
- Begin after AG splice signal
- Ends with stop codon and a poly-A tai
Name 4 useful machine learning methods
- Neural networks
- Support vector machines
- Decision trees
- Naïve Bayesian classifiers
How are neural networks modeled ?
Arrange many interconnected ‘perceptrons’ in multiple interconnected layers.
Why are tranlsation methods far from perfect ?
- A coding region may be missed
- An incomplete protein may be reported
- Splicing may be predicted incorrectly
- Coding regions may overlap
- Exon assembly (splicing) may be different in different tissues
- Some apparent coding sequences may be defective or not expressed