Bioinformatics Flashcards

1
Q

What is bioinformatics ?

A

The application of computers to problems in Biology :
Aiding the Biologist in creating, storing, searching and analyzing biological data (particularly sequences and
structures), presenting it in a way Biologists can use and applying the analysis to make predictions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

In what was is biology overwhelmed with data ?

A
Human genome ~ 3.2Gbp
Raw data to fill ~ 5 CDs
DNA sequence databanks > 2.482 x 10^12 bp (passed 100Gbp Aug 2005) i.e. ~579 HUGEs (was 425 in Oct 15, 579 in Oct 16 ) or ~806 years’ issues of the New York Times
> 703,140,000 entries (Oct 2017)
> 124,500 protein structures (Oct 2017)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How often does DNA data double ?

What about structure data ?

A

DNA data doubles every ~ 1.5-3 years

Structure data doubles every ~ 6 years

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the content of the PDB (Protein Data Bank) ?

A

PDB contains structural data :
> 112,500 x-ray
> 10,500 NMR
~ 52,300 sequences (<95% SeqID)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How big is the human genome ?
How much of it actually codes for protein ?
How much of it is repeated sequences ?
How many genes undergo alternative splicing ?

A
~ 3.2Gbp 30x C. elegans or D. melanogaster
< 5% coding
> 50% repeated sequences
~ 35% genes have alternative 
splicing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a database ?

A

Database : A structured collection of data with some

tool enabling it to be ‘queried’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a databank ?

A

Databank : A collection of data (normally in simple text

files) without an associated query tool

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the three types of databanks ?

A

Primary : Raw sequence/structure data, possibly with detailed annotations
Secondary : Derived data - sequence profiles, etc, generally highly annotated
Meta-databanks : collections of links between databanks and
databases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Give examples of primary databanks.

A

Genbank, EMBL, DDBJ, UniProtKB/Swissprot, PIR

  • Simply contain sequence data (DNA or protein)
  • May also have ‘feature’ information (splice sites, signal sequences, disulphides, active sites, etc., etc.)
  • DNA databanks may also contain translations (known or predicted)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How can we be sure about a protein identified from genome data ?

A

We can’t !
Since gene-prediction methods are imperfect, a protein identified from genome data is hypothetical until verified by experiment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does Enzyme contain ?

A

Enzyme classifications (EC numbers).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Give examples of secondary databanks.

A

PROSITE, PRINTS, BLOCKS, INTERPRO

These contain derived information patterns that characterize a protein family + detailed annotation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How can we carry out similarity measures ?

A

Simple identity scoring -1 for a match; 0 for a mismatch
More complex scoring schemes –> ccount for similarity
between amino acids –> typically derived from analysis of
aligned homologous proteins - which substitutions are observed
First done by Dayhoff (1978) + improved by Henikoff & Henikoff (1992) - BLOSUM matrices

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Who introduced dynamic prgramming (global + local) and when ?

A

Introduced by Needleman & Wunsch(1970) - global

Formalized by Smith & Waterman (1981) - local

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How do fast methods (FAST/BLAST) work ?

A

Approximate fast methods (‘heuristics’)
Index the database by finding locations of short ‘words’
Take ‘words’ from the probe sequence and look them up in the index
Look for multiple matches and extend to find likely hits to full alignment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How is DNA sequenced ?

How does this apply to entire genomes ? i.e, LONG stretches of DNA

A

Sequencing : Sanger method: di-deoxy chain termination + ‘Nextgen’ methods
Whole genome : use shotgun sequencing = shatter genome into segments < 2000 bp + assemble fragment w/ computer science (algorithm)

17
Q

What is an algorithm ?

A

A complete and precise set of steps that will solve a problem and achieve an identical result whenever given the same set of data to a defined level of accuracy.

  • Ordered steps
  • Repeatable
  • Known/defined accuracy
18
Q

How can we program an algorithm for fragment assembly ?

What problems do we encounter ?

A
  • Work down from a big overlap window to a small one
  • Enforce a minimum overlap size
  • Fuzzy matches - account for errors in sequencing
  • Apply a confidence score
  • Problems with sequence repeats
19
Q

Why is it tricky to use bioinformatics to model translation ?

A

Computers are ideal for boring repetitive tasks like performing translation to a protein sequence, but not all DNA codes for protein: control regions + junk

20
Q

How can we model translation in prokaryotes ?

A

Find a start codon –> continue to a stop codon (though not all ORFs are used)

21
Q

What are the problems with modeling translation in eukaryotes ?

A
  • Transcription start sites are not obvious
  • Introns interrupt the genes and get
    spliced out of the mRNA
22
Q

What are the 2 main approaches for modelling translation in eukaryotes ?

A
  1. Detect similarity with known coding regions :
    - Regions similar to known proteins
    - Regions that map to ‘Expressed Sequence Tags’ (ESTs)
  2. Ab initio methods
    - Make predictions based on typical features (GT/AG splice signals, sequence composition, etc.)
    Initial 5’ exon
    - Transcription start point with upstream promoter + ends immediately before a GT splice signal
    Internal exons
    - Begin after AG ; end before a GT splice signal
    Final 3’ exon
    - Begin after AG splice signal
    - Ends with stop codon and a poly-A tai
23
Q

Name 4 useful machine learning methods

A
  • Neural networks
  • Support vector machines
  • Decision trees
  • Naïve Bayesian classifiers
24
Q

How are neural networks modeled ?

A

Arrange many interconnected ‘perceptrons’ in multiple interconnected layers.

25
Q

Why are tranlsation methods far from perfect ?

A
  • A coding region may be missed
  • An incomplete protein may be reported
  • Splicing may be predicted incorrectly
  • Coding regions may overlap
  • Exon assembly (splicing) may be different in different tissues
  • Some apparent coding sequences may be defective or not expressed