Bioinformatics Flashcards

Question 1

Q

What is bioinformatics ?

Answer

A

The application of computers to problems in Biology :
Aiding the Biologist in creating, storing, searching and analyzing biological data (particularly sequences and
structures), presenting it in a way Biologists can use and applying the analysis to make predictions.

Question 2

Q

In what was is biology overwhelmed with data ?

Answer

A

Human genome ~ 3.2Gbp
Raw data to fill ~ 5 CDs
DNA sequence databanks > 2.482 x 10^12 bp (passed 100Gbp Aug 2005) i.e. ~579 HUGEs (was 425 in Oct 15, 579 in Oct 16 ) or ~806 years’ issues of the New York Times
> 703,140,000 entries (Oct 2017)
> 124,500 protein structures (Oct 2017)

Question 3

Q

How often does DNA data double ?

What about structure data ?

Answer

A

DNA data doubles every ~ 1.5-3 years

Structure data doubles every ~ 6 years

Question 4

Q

What is the content of the PDB (Protein Data Bank) ?

Answer

A

PDB contains structural data :
> 112,500 x-ray
> 10,500 NMR
~ 52,300 sequences (<95% SeqID)

Question 5

Q

How big is the human genome ?
How much of it actually codes for protein ?
How much of it is repeated sequences ?
How many genes undergo alternative splicing ?

Answer

A

~ 3.2Gbp 30x C. elegans or D. melanogaster
< 5% coding
> 50% repeated sequences
~ 35% genes have alternative 
splicing

Question 6

Q

What is a database ?

Answer

A

Database : A structured collection of data with some

tool enabling it to be ‘queried’

Question 7

Q

What is a databank ?

Answer

A

Databank : A collection of data (normally in simple text

files) without an associated query tool

Question 8

Q

What are the three types of databanks ?

Answer

A

Primary : Raw sequence/structure data, possibly with detailed annotations
Secondary : Derived data - sequence profiles, etc, generally highly annotated
Meta-databanks : collections of links between databanks and
databases

Question 9

Q

Give examples of primary databanks.

Answer

A

Genbank, EMBL, DDBJ, UniProtKB/Swissprot, PIR

Simply contain sequence data (DNA or protein)
May also have ‘feature’ information (splice sites, signal sequences, disulphides, active sites, etc., etc.)
DNA databanks may also contain translations (known or predicted)

Question 10

Q

How can we be sure about a protein identified from genome data ?

Answer

A

We can’t !
Since gene-prediction methods are imperfect, a protein identified from genome data is hypothetical until verified by experiment.

Question 11

Q

What does Enzyme contain ?

Answer

A

Enzyme classifications (EC numbers).

Question 12

Q

Give examples of secondary databanks.

Answer

A

PROSITE, PRINTS, BLOCKS, INTERPRO

These contain derived information patterns that characterize a protein family + detailed annotation

Question 13

Q

How can we carry out similarity measures ?

Answer

A

Simple identity scoring -1 for a match; 0 for a mismatch
More complex scoring schemes –> ccount for similarity
between amino acids –> typically derived from analysis of
aligned homologous proteins - which substitutions are observed
First done by Dayhoff (1978) + improved by Henikoff & Henikoff (1992) - BLOSUM matrices

Question 14

Q

Who introduced dynamic prgramming (global + local) and when ?

Answer

A

Introduced by Needleman & Wunsch(1970) - global

Formalized by Smith & Waterman (1981) - local

Question 15

Q

How do fast methods (FAST/BLAST) work ?

Answer

A

Approximate fast methods (‘heuristics’)
Index the database by finding locations of short ‘words’
Take ‘words’ from the probe sequence and look them up in the index
Look for multiple matches and extend to find likely hits to full alignment

Question 16

Q

How is DNA sequenced ?

How does this apply to entire genomes ? i.e, LONG stretches of DNA

Answer

A

Sequencing : Sanger method: di-deoxy chain termination + ‘Nextgen’ methods
Whole genome : use shotgun sequencing = shatter genome into segments < 2000 bp + assemble fragment w/ computer science (algorithm)

Question 17

Q

What is an algorithm ?

Answer

A

A complete and precise set of steps that will solve a problem and achieve an identical result whenever given the same set of data to a defined level of accuracy.

Ordered steps
Repeatable
Known/defined accuracy

Question 18

Q

How can we program an algorithm for fragment assembly ?

What problems do we encounter ?

Answer

A

Work down from a big overlap window to a small one
Enforce a minimum overlap size
Fuzzy matches - account for errors in sequencing
Apply a confidence score
Problems with sequence repeats

Question 19

Q

Why is it tricky to use bioinformatics to model translation ?

Answer

A

Computers are ideal for boring repetitive tasks like performing translation to a protein sequence, but not all DNA codes for protein: control regions + junk

Question 20

Q

How can we model translation in prokaryotes ?

Answer

A

Find a start codon –> continue to a stop codon (though not all ORFs are used)

Question 21

Q

What are the problems with modeling translation in eukaryotes ?

Answer

A

Transcription start sites are not obvious
Introns interrupt the genes and get
spliced out of the mRNA

Question 22

Q

What are the 2 main approaches for modelling translation in eukaryotes ?

Answer

A

Detect similarity with known coding regions :
- Regions similar to known proteins
- Regions that map to ‘Expressed Sequence Tags’ (ESTs)
Ab initio methods
- Make predictions based on typical features (GT/AG splice signals, sequence composition, etc.)
Initial 5’ exon
- Transcription start point with upstream promoter + ends immediately before a GT splice signal
Internal exons
- Begin after AG ; end before a GT splice signal
Final 3’ exon
- Begin after AG splice signal
- Ends with stop codon and a poly-A tai

Question 23

Q

Name 4 useful machine learning methods

Answer

A

Neural networks
Support vector machines
Decision trees
Naïve Bayesian classifiers

Question 24

Q

How are neural networks modeled ?

Answer

A

Arrange many interconnected ‘perceptrons’ in multiple interconnected layers.

Question 25

Q

Why are tranlsation methods far from perfect ?

Answer

A

A coding region may be missed
An incomplete protein may be reported
Splicing may be predicted incorrectly
Coding regions may overlap
Exon assembly (splicing) may be different in different tissues
Some apparent coding sequences may be defective or not expressed