Week 4 - Bacterial Genomics Flashcards
Genome
- entire complement of genetic information
* includes genes, regulatory sequences, and noncoding DNA
Genomics
discipline of mapping, sequencing, analyzing, and comparing genomes
Number of prokaryotic genomes sequenced
over 12,000
RNA virus MS2
- first genome sequenced in 1976
* 3,569 bp
Haemophilus influenzae
- first cellular genome sequenced in 1995
* 1,830,137 bp
Large-scale sequencing projects have led to automated DNA sequencing systems
- based on Sanger method
* radioactivity replaced by fluorescent dye
Sequencing
determines the order of nucleotides in a DNA or RNA molecule
Sanger dideoxy method
- invented by Fred Sanger (Nobel Prize winner)
- 2 sequencing techniques were developed independently in the 1970s. The method developed by Fred Sanger used chemically altered “dideoxy” bases to terminate newly synthesized DNA fragments at specific bases (either A, C, T, or G)
- these fragments can then be size-separated , and the DNA sequence can be read
Purines
adenine
guanine
• two rings
Pyrimidines
cytosine
uracil
thymine
• one ring
Determining the sequence of DNA
1. chain termination or dideoxy method (F. Sanger0 2. shotgun sequence method 3. second generation sequence methods (pyrosequencing)
Dideoxy (Sanger) method - steps
- denaturation
- primer attachment and extension of bases
- termination
- gel electrophoresis
produces chromatograph - laser detectioin of fluorchromes and computational sequence analysis
Sanger reaction mixture
- primer and DNA template
- ddNTPs with flourchromes
- DNA polymerase
- dNTPs (dATP, dCTP, dGTP, dTTP)
What’s wrong with the Sanger/dideoxy method?
- only good for 500-750bp reactions
- expensive
- takes time
- the human genome is about 3 million bp
Shotgun sequencing
used to sequence whole genomes
Steps of shotgun sequencing
- DNA is randomly broken up into smaller fragments
- dideoxy method produces reads
- look for overlap of reads
Whole genome shotgun sequencing
• in whole genome shotgun sequencing the entire genome is sheared randomly into small fragments (appropriately sized for sequencing) and then reassembled
Hierarchical shotgun sequencing
- the genome is first broken into larger segments
- after the order of these segments is deduced, they are further sheared into fragments appropriately sized for sequencing
Pyrosequencing
- each nucleotide is added in turn
- only 1 of 4 will generate a light signal
- the remaining nucleotides are removed enzymatically
- the light signal is recorded on a pyrogram
- sequencing by synthesis
Advantages of pyrosequencing
- accurate
- parallel processing
- easily automated
- eliminates the need for labeled primers and nucleotides
- no need for gel electorphoresis
Basic idea of pyrosequencing
- visible light is generated and is proportional to the number of incorporated nucleotides
- 1 pmol DNA = 6e11 ATP = 6e9 photons at 560nm
Pyrosequencing - 1st method
solid phase
• immobilized DNA
• 3 enzymes
• wash step to remove nucleotides after each addition
Pyrosequencing - 2nd method
liquid phase
• 3 enzymes + apyrase (nucleotide degradation enzyme)
(eliminates need for washing step)
• in the will of a microtiter plate: primed DNA template and 4 enzymes
• nucleotides are added stepwise
• nucleotide-degrading enzymes degrade previous nucleotides
Pyrosequencing disadvantages
- smaller sequences
* nonlinear light response after more than 5-6 identical nucleotides
454 sequencing system
• recent technological advance
• generates data 100x faster than Sanger method
• 454 relies on 2 major advances
- massively parallel liquid handling and pyrosequencing
– light is released each time a base is added to DNA strand
– instrument actually measures releaes of light
– can only handle short stretches of DNA
Virtually all genomic sequencing projects use
shotgun sequencing
• entire genome is cloned and resultant clones are sequenced
• much of the sequencing is redundant
• generally 7- to 10-fold coverage
- computer algorithms used to look for replicate sequences and assemble them
- occasionally assembly isn’t possible
- closure can be pursued using PCR to target areas of the genome
Closed vs Draft genome
- closed genome relies on manpower
- more expensive
- more information
Annotation
converting raw sequence data into a list of genes present in the genome
Majority of genes encode
proteins
Functional ORF
an open reading frame that encodes a protein
• computer algorithms used to search for ORFs
- look up start/stop codons and Shine-Delgaro sequences
• ORFs can be compared to ORFs in other genomes
Inaccuracies in some annotations are problematic
as many as10% of annotated genes are incorrectly annotated
Dideoxy method summary
- chain termination method
* best for small DNA segments
Whole genome shotgun sequencing summary
- sequence human genome
* fragments larger DNA strand to make manageable chunks
Pyrosequencing summary
- sequence by synthesis
* accurate and fast
Bioinformatics
- science that applies powerful computational tools to DNA and protein sequences
- for the purpose of analyzing, storing, and accessing the sequences for comparative purposes
Correlation between genome size and ORFs
• on average a prokaryotic gene is 1,000 bp long
- ~ 1,000 genes per megabase
(1Mbp = 1,000,000 bp)
- as genome size increases, gene content proportionally increases
First complete bacterial genome sequenced in
1995
• now routine and many hundreds of bacterial genomes have been sequenced
“Traditional” sequencing methods are now supplemented by
- “environmental genome sequencing” - sequence DNA from an environmental sample, without isolating and culturing strains first
- “RNA sequencing” - “deep sequencing” of RNA to reveal the frequency of different RNA molecules
Smallest cellular genomes belong to
parasitic or endosymbiotic prokaryotes
• obligate parasites range from 490kbp (Nanoarchaeum equitans) or 4,400 kbp (Mycobacterium tuberculosis)
• endosymbionts can be smaller (eg 160 bp genome of Carsonella ruddii)
• estimates suggest that the minimum number of genes fora viable cell is 250-300 genes
Obligate parasites (genome)
from 490 kbp (Nanoarchaeum equitans)
to 4,400 kbp (Mycobacterium tuberculosis)
Endosymbionts (genome)
can be smaller
eg 160 bp genome of Carsonella ruddii
Estimates suggest the minimum number of genes for a viable cell is
250-300 genes
Largest prokaryotic genomes are comparable to those of some eukaryotes
Sorangium cellulosum (bacteria) • largest prokaryotic genome to date is 12.3 Mbp
largest archaeal genomes tend to be smaller (~5 Mbp)
Complement of genes in a particular organism defines its biology, but genomes are also molded by
an organisms lifestyle
Many genes can be identified by
sequence similarity to genes found in other organisms (comparative analysis)
Comparative analyses allow for
predictions of metabolic pathways and transport systems
• eg Thermotoga maritima
Escheria coli
- 4.6 MB
* 4405 genes
Streptomyces coelicolor
- 8.7 MB
* 7825 genes
Mycoplasma genitalium
- 0.58 MB
* 482 genes
Methanococcus jannaschii
- 1.66 MB
* 1738 genes
• Prochlorococcus marinus
- 1.67 MB
* 1696 genes
Aabaena cylindrica
- 6.36 MB
* 6132 genes
In addition to the main chromosome, many bacteria also have
stable plasmids - much smaller circular DNA molecules, usually with a few genes
Range of genome sizes
- Mycoplasma genitalium - 0.58 MB
- Streptomyces coelicolor - .8 MB
- Escheria coli is fairly average - 4.60 MB with circular chromosome about 1.4mm in circumference, 1.44mm long, diameter of 0,45 mm (E. coli cell 4micrometers long)
E. coli normally has a single copy
of its chromosome per cell - or 2 copies when the cell is about to divide
Some bacteria have
multiple copies of the chromosome
• eg cyanobacteria typically have about 10 copies of the chromosome in every cell
• eg a Synechocystis cell is about 3 micromenters in diameter and each cell contains DNA with a total length of about 11mm
Bacterial DNA is
tightly folded and packed into an irregular structure in the cytoplasm - the nucleoid
The nucleoid
- by weight about 60% DNA, 30% RNA, 10% protein
- RNA and proteins probably help to fold DNA into a compact structure
- with very rare exceptions, no surrounding membrane - in bacteria DNA is freely exposed to the cytoplasm
- BUT the nucleoid is usually attached to the plasma membrane at one point
DNA replication
• starts from a single, defied origin
• is bidirectional
(origin of replication, replication forks (2, theta), 2 new double-stranded circular DNA molecules)
In eukaryotes, replication is initiated at
multiple loci along the chromosome
DNA replication in bacteria can
only start at one point
DNA replication in bacteria takes a minimum of about
30 minutes for replication to be complete (depending on the genome size)
• BUT the mean doubling time for some bacteria is less than this, under optimal conditions - how?
Sequences beginning with a START codon followed by a long run of codons before he first STOP codon
are very unlikely to occur by chance
• such a sequence is known as an ORF and is potentially a sequence coding for a protein (a gene)
Start codons
- ATG
* GTG
Stop codons
- TAA
- TAG
- TGA
The cell recognizes genes in a different way to just Start and Stop codons
- control sequences upstream of the ORF promote binding of RNA polymerase
- hence transcription to RNA followed by translation of the RNA to make protein
- but those control sequences are very hard for us to recognize
Structure of a typical bacterial gene
5' • regulatory sequences • RNA polymerase binding • leader sequence (RNA - ribosome binding) • Coding region ORF (RNA - coding region ORF) • trailer (RNA - trailer) terminator 3'
Total predicted ORFs in Synechocystis
3186 ORFs predicted in total
• genes can be on either strand of the DNA
Some lessons learned from bacterial genome sequencing
- numbers of genes, relationship to complexity of the organism
- a possible minimum set of genes - idenify the common minimal set of genes needed for viability?
- dense packing of genes in bacterial chromosomes
- organization of genes in operons
- evolutionary diversity
- evolutionary relationships
- large number of unknown genes (40-60%)
Some lessons learned from bacterial genome sequencing
1. number of genes, relationship to complexity of the organism
rough correspondence between genome size and complexity of lifestyle
• Mycoplasma genitalium (0.58 MB, 482 genes) - parasite with very small cells and simple metabolism
• Streptomyces coelicolor (8.7 MB, 7825 genes) - soil bacterium with very versatile metabolism, complex structure (branched network of filaments), sporulation
• Prochlorococcus marinus and Anabaen cylindrica are both cyanobacteria
- Prochlorococcus (1.67 MB, 1696 genes) - has small, simple cells
- Anabaena (6.37 MB, 6132 genes) - filamentous, multiple cell types
Mycoplasma genitalium (0.58 MB, 482 genes)
parasite with very small cells and simple metabolism
• Streptomyces coelicolor (8.7 MB, 7825 genes) - soil bacterium with very versatile metabolism, complex structure (branched network of filaments), sporulation
soil bacterium with very versatile metabolism, complex structure (branched network of filaments), sporulation
Prochlorococcus marinus and Anabaen cylindrica are both
cyanobacteria
- Prochlorococcus (1.67 MB, 1696 genes) - has small, simple cells
- Anabaena (6.37 MB, 6132 genes) - filamentous, multiple cell types
Some lessons from bacterial genome sequencing
2. a possible minimum set of genes - identify the common minimal set of genes needed for viability?
• Craig Venter’s plan to further strip down the genome of Mycoplasma genitalium to create a minimum living organism of about 300 genes
Some lessons from bacterial genome sequencing
3. Dense packing of genes in the bacterial chromosome
bacteria typically about 1 gene per 1,100 bases in H. sapiens about 1 gene per 30,000 bases
• bacteria have dense clustering of genes - very different from eukaryotes
Bacteria typically have about 1 gene per
1,100 bases
Homo sapiens have about 1 gene per
30,000 bases
Some lessons from bacterial genome sequencing
4. organization of genes in operons
- clusters of genes on the same DNA strand with related functions likely to be operons
- genes are co-transcribed (ie 1 mRNA molecule for the whole operon)
Some lessons learned from bacterial genome sequencing
5. evolutionary diversity of prokaryotes
why are bacterial genes and genomes so diverse?
probably 2 reasons
a. bacterial metabolic diversity - different bacterial species may have fundamentally different metabolism, hence the need for quite different sets of genes
b. deep evolutionary roots - bacteria have been on the planet much longer than other life forms - hence greater time for evolutionary divergence
Some lessons learned from bacterial genome sequencing
6. evolutionary relationships
comparing related species of pathogenic bacteria - can we track pathogen evolution, and can we identify specific genes that are important for specific pathogenecities?
• Mycobacterium bovis - bovine tuberculosis - 3952 genes
• Mycobacterium tuberculosis - human tuberculosis - 4238 genes
• Mycobacterium leprae - leprosy - 2768 genes
• classical microbiology shows different host range, virulence, and physiology - but what is the genetic basis of the differences? and how are the 2 species related? did M. bovis jump the species barrier from cattle to humans when cattle were domesticated 10,000 - 15,000 years ago?
6. Evolutionary relationships both organisms (M. bovis and M. tuberculosis) now completely sequenced - what does comparison of the genomes tell us?
- very closely related >99.95% sequence identity
- nearly all ORFs are conserved, and are in the same order on the chromosome - no rearrangements
- therefore recent divergence
- but M. bovis has a slightly smaller genome, and a series of deletions resulting in about 300 fewer genes. It looks as though M. tuberculosis is closer to the common ancestor - did cows catch TB from us?
Some lessons from bacterial genome sequencing
7. large number of unknown genes (typically 40-60%)
so one of the main lessons from genome sequencing is how much we don’t known about bacterial biology
Summary
- genetic information is stored int he order or sequence of nucleotides in DNA
- chain termination sequencing is the standard method for the determination of nucleotide sequence
- dideoxy-chain termination sequencing has been facilitated by the development of cycle sequencing and the use of fluorescent dye detection
- alternative methods are used for special applications, such as pyrosequencing (for resequencing and polymorphism detection) or bisulfite sequencing (to analyze methylated DNA)