Module 7: Microbial Genomics (DNA Sequencing + Analysis) Flashcards
Genome
An organism’s complete set of DNA including ALL of its genes
Genetics
The study of individual genes and their functions
Genomics
The determination and study of complete genome sequences
–> An interdisciplinary field investigating many aspects of genomes
How are genetics and genomics different?
Genetics deals with individual genes whereas genomics deals with the genome as a whole
Genomics is an _________ field that investigates what aspects of genomes? (5)
Interdisciplinary
Investigates Genome…
1) Structure
2) Function
3) Evolution
4) Mapping
5) Editing
Genomic sequences allows us to link genetic characteristics of microbes to their what? (2)
1) Physiological properties
2) Ecological Role
The study of genomes allows us to: (3)
(in an overall sense)
1) Generate hypotheses
2) Investigate Differences (between genomes)
3) Examine evolutionary relationships
Genomic analysis allows for the examination of evolutionary relationships of microbes: this then facillitates..
medical and epidemiological analysis of infectious disease
The establishment of genomics and molecular biology was made possibly by: (2)
1) the advent of recombinant DNA techniques
2) Development of DNA sequencing methods
What were the first DNA sequencing methods?
Two different methods arose at roughly the same time:
1) Fred Sanger’s Enzymatic Method (Sanger Sequencing)
2) Walter Gilbert + Alan Maxam’s Chemical Degradation Method
DNA Sequencing
The process of determining the nucleic acid sequence of DNA
Most sequencing methods to this day are variations of what?
Variations of the Sanger Sequencing Method
What is Sanger Sequencing also known as?
Sanger Sequencing =
“Dideoxy Sequencing” + “Chain-Termination Method”
Sanger sequencing takes advantage of the ______________ nature of DNA polymerase
What does this mean?
Takes advantage of the PRIMER-SPECIFIC nature of DNA polymerase
== Sequencing methodology is based upon the principle that DNA polymerase requires a free 3’ OH end (typically from a primer) to begin DNA synthesis
What are the 3 main steps of Sanger sequencing?
1) DNA cloning
2) DNA synthesis
3) Gel Electrophoresis
What does DNA polymerase do in DNA synthesis?
Adds nucleotides together via formation of phosphodiester bonds between free 3’ OH and 5’ PO4 ends
dNTPs
deoxyribonucleotide triphosphate –> The building blocks of DNA that get connected together by DNA polymerase
–> Consists of a 5 carbon ribose sugar, a 3’ Carbon hydroxyl group, a 5’ carbon triphosphate, and a 1’ carbon nitrogenous base
ddNTPs
Dideoxyribonucleotide triphosphate
–> Like dNTPs BUT lack a 3’ OH grp and instead has just a hydrogen on the 3’ carbon
What is at the 5’ and 3’ ends of ddNTPs and why is this significant for sanger sequencing?
5’ End = Free phopshate grp
3’ End = Hydrogen atom (no OH)
–> This means that it can ADD to an elongating chain but nothing can be added to IT!
== Important for chain termination in sanger sequencing!
When a ddNTP is incorporated into a growing strand…
Why does this happen?
DNA synthesis stops!
–> Because the added ddNTP has no free 3’OH grp for DNA polymerase to catalyze formation of a new phosphodiester bond with another nucleotide
Why are ddNTPs called “dideoxy” ?
Because they are “deoxygenated” on two carbons (3’ and 2’)
DNA = “deoxy” because it’s “deoxygenated” on one carbon (2’)
Explain phosphodiester bond formation
The free 3’ OH grp of the last added nucleotide to a growing strand reacts with the 5’ PO4 end of a new nucleotide
–> In doing so, water is released, the phosphodiester bond forms and the bond from the first phosphorus to the other 2 phosphate groups is cleaved and pyrophosphate (PP) is released
In Sanger sequencing, synthesis of new DNA strand is purposefully _________________ by doing what?
DNA strand synthesis is purposefully TRUNCATED / terminated
–> Occurs through incorporation of ddNTP into the growing strand = building block with NO 3’OH
What cannot form with a ddNTP?
A phosphodiester bond at its 3’ end
(Lacks OH grp needed to produce this bond at this carbon)
ddNTPs are also known as…
Chain Terminators
What is the process of Sanger sequencing?
1) DNA of interest cloned into a vector (typically plasmid)
2) Vector is denatured to produce an ssDNA template
3) 4 rxn tubes are prepared, each containing:
ONE of the four LABELED ddNTPs, ssDNA recomb. vector template, primer, dNTPs (regular), DNA polymerase
4) DNA synthesis rxn is left to occur
5) DNA products are DENATURED = Releases the newly synthesized strand from the template strand (from the recomb. vector)
6) Contents of the FOUR rxn tubes are added to individual wells on a PA gel (4 wells, each containing DNA frags w/ a different labeled ddNTP)
7) Electrophoresis (PAGE)
8) Gel bands are visualized (method depends on label used for the ddNTPs)
9) Sequence is put together with FIRST base being at the BOTTOM of the gel and subsequent bases in order moving up the gel
Why must the recombinant vector DNA be denatured for sanger sequencing?
To create an ssDNA template that DNA synthesis can then conduct new strand formation upon == AND eventually result in the labelling of a base by addition of a ddNTP
–> Essentially, denaturation allows for a desired template to be utilized!
Why is it important for DNA of interested to be cloned into a vector first before sanger sequencing process?
The vector is what allows for the DNA to get a primer attached to it to then allow for new strand formation
–> Primers are designed to bind to adjacent vector sequence = ALL of the desired segment can get replicated!
What are the contents of each rxn tube in Sanger sequencing?
1) Designed oligonucleotide primers
2) DNA polymerase
3) ssDNA template of desired DNA (within a recomb. vector)
4) A small proportion of LABELED ddNTPs (radiolabeled or fluorophore)
5) Larger portion of dNTPs (regular)
DNA synthesis during Sanger Sequencing produces products that are…
NOT uniform!
–> Of varying lengths due to different sites of ddNTP incorporation (each fragment terminated with the given ddNTP but the site at which incorporation occurred can differ)
In sanger sequencing, what determines the length of produced fragments?
Site of ddNTP insertion!
–> If ddNTP is incorporated earlier in DNA synthesis, product will have a SHORTER growing strand
–> If ddNTP is incorporrated later in DNA synthesis, product will have a LONGER growing strand
In sanger sequencing, what must be done to the DNA products BEFORE loading onto gel? WHY?
DNA products must be DENATURED –> Needed to separate the DNA template strand from the newly synthesized strand
That way electrophoresis will actually separate based upon size of the newly synthesized terminated strands
What was the purpose of PAGE in sanger sequencing?
To separate DNA fragments based upon size == determination of sequence
–> Essentially, to put in order the different fragments from smallest to biggest and thus, put in order the identities of ddNTPs causing termination in each ordered fragment = to put in order the DNA sequence!
From which direction should a PAGE be “read” when determining sequence in sanger sequencing?
Should be read BOTTOM (smallest fragments) to TOP (largest fragments; near wells)
Bottom = More migration = smaller fragment = earlier truncation = earlier in seq.
Top = Less migration = larger fragment = later truncation = later in seq.
What must be added to the ddNTPs for sanger sequencing?
A labelling method!
= Radiolabel or fluorescent label
–> Allows for detection of bands during PAGE later on!
What is the primer in sanger sequencing designed for?
Primer is designed to bind to vector DNA directly adjacent to the DNA of interest within the recomb vector
= Sets the free 3’OH right before the DNA of interest = DNA polymerase begins synthesis of the DNA of interest right away!
Primers
A short, single-stranded DNA or RNA sequence that serves as a starting point for DNA synthesis by providing a free 3’OH end to begin adding dNTPs
What type of molecule are primers considered?
Oligonucleotides
(short, single- or double-stranded DNA or RNA molecules)
Sequence obtained via sanger sequencing is ________ equivalent to the template strand
NOT the same as the template strand,
It is COMPLEMENTARY to the template strand sequence
Why is the ratio of dNTP to ddNTP in rxn mixtures for sanger sequencing important?
Ratio is set up so that there is a smaller amount of ddNTP than dNTPs
== Allows for variable incorporation and thus variable termination! Because there isn’t so much ddNTP, the frequency of its incorporation is lower = more likely to incorporate into different sites at different point in the DNA synthesis process
What do all products of ONE rxn tube in sanger sequencing have in common?
They are ALL terminated by the SAME ddNTP!
(Have the same labeled endpoint ddNTP)
What are the methods of band detection in sanger sequencing?
Radiolabel –> Detected w/ X-ray film exposure
Fluorochrome –> Detected w/ lasers
What data can we get from the bands of a gel produced from sanger sequencing?
1) Order of DNA frags (bands) == Order of the nucleotide sequence
2) # of a given nitrogenous base in the sequence (3 bands in lane A = 3 As in the sequence!)
What parts of the gel from sanger sequencing correspond to the 5’ and 3’ ends of a DNA segment getting sequenced?
5’ End = BOTTOM of gel (smaller frags)
3’ End = TOP of gel (larger frags)
What were some improvements made to sanger sequencing? (3)
1) Use of thermostable DNA polymerase
2) Use of fluorescent dyes
3) “Base Calling Software” (automation of analysis)
4) Automation of sequencing process
How does use of thermostable DNA polymerase improve sanger sequencing?
By allowing multiple rounds of sequencing to occur from a SINGLE template strand == significantly amplifies amount of product!
How does use of fluorescent dyes/labels improve sanger sequencing?
1) Cheaper
2) Safer
3) Easier
4) More efficient = allows for the process to occur in ONE tube (by labeling each ddNTP type with a DIFFERENT COLOR of label!)
What does “base calling” software do (sanger sequencing)?
Translates raw sequencing signals (like electrical current changes or chromatogram peaks) into nucleotide sequences
–> Automates the identification and ordering of the gel electrophoresis bands (no longer has to be done by human! )
What is “automated-“ or “cycle-“ sequencing?
A modified form of sanger sequencing that employs PCR at the same time to allow for template strand to reform and re-react to produce multiple fragments!
(allows for sequencing from a smaller amount of template DNA!)
“orchestrate iterative denaturation, annealing, and extension of DNA fragments. This cyclic modality amplifies the efficiency and resilience of the sequencing reaction”
Primer Walking
The use of repeated rounds of sequencing with primers designed to be complementary to the END of the last (most recent previous) sequenced segment (of the overall DNA of interest)
What is primer walking used for?
used for sequencing DNA of interest that is LONGER than a given sequencing method can measure
Process of primer walking (from start):
1) Clone DNA of interest into vector
2) Denature to produce ssDNA template
3) Add vector complementary primer
4) Sequence DNA (with some method) to yield as long of a sequence as you can
5) Design NEW primer that is complementary to the END of the sequence you just got
6) Treat template with this primer + sequence
7) REPEAT!
== you “walk” through the DNA of interest sequence!
How is the sequence of an ENTIRE desired DNA segment determined from Primer walking?
All sequences collected from repeated rounds display overlapping coverage which allows for us to put the fragments of sequences into order properly + “sum” them together to get the full sequence!
Fragments produced by primer walking display…
OVERLAPPING COVERAGE
How long can sequences derived from primer walking be?
Primer walking can derive sequences of 700-1000 BPs long
High-Throughput Sequencing
AKA Next-Generation Sequencing
–> New methods of sequencing using post-sanger methods to sequence MANY DNA fragments simultaneously!!
(methods of sequencing that are NOT limited to sequence one DNA segment like in sanger)
What is different about the read length and # of reads between sanger and High-TP methods?
Read Length:
Sanger = Longer reads
High-TP = Shorter reads (25-500 BPs)
# of Reads (in a given period of time):
Sanger = < # of reads
High-TP = > # of reads (100,000s-millions of reads in shorter period of time
What contributes to the greater coverage of High-TP methods?
Shorter AND more frequent reads!
Each base is “covered” by a greater number of aligned sequence reads = base calls can be made with a higher degree of confidence
Coverage
the average number of times a given nucleotide position in the genome has been sequenced
(essentially how many fragments have produced the same reads of a base for a given nucleotide site!)
Benefits of High-TP methods:
1) Cheaper
2) Faster (more efficient)
3) Allows collection of sequences from MANY molecules simultaneously (> amount of data produced!)
Unlike sanger sequencing, High-TP methods generally do not HAVE to rely on…
1) Gel electrophoresis
2) Cloning of the DNA of interest!
What was the 1st Next-gen. sequencing method?
Pyrosequencing
Pyrosequencing
A high-tp method that sequences DNA in real-time through detection of light produced from rxn catalyzed from pyrophosphate release during dNTP incorporatio n
Similarity between pyrosequencing and sanger sequencing
Both are ENZYMATIC methods
–> Both rely upon complementary strand synthesis for sequence determination
Differences between pyrosequencing and sanger sequencing
1) Pyroseq. sequences in real time (as strand is being produced) vs sanger which sequences AFTER strand is produced
2) no cloning of DNA in Pyroseq vs sanger seq. requires DNA to be cloned 1st
3) Pyroseq does NOT use ddNTPs! vs sanger seq relies on labeled ddNTPs
4) Pyroseq involves detection of dNTP incorporation vs sanger seq detects the incorporation of a ddNTP
5) no gel electrophoresis in pyroseq vs sanger seq relies on PAGE for ordering of nucleotides
What is the chemical basis of pyrosequencing?
(what is the process leading to detection of dNTP incorporation?)
During dNTP incorporation…
1) Phosphodiester bond formation rxn occurs
2) Bond formation RELEASES pyrophosphate (PPi)
3) Pyrophosphate (PPi) + Adenosine phosphosulfate (APS) react together via CATALYSIS from ATP-SULFURYLASE to form ATP!
4) ATP drives luciferase rxn to produce LIGHT
5) Light is detected == dNTP was incorporated!
What rxn does PPi undergo?
PPi reacts with APS by ATP-sulfurylase to produce ATP
In pyrosequencing, how is the next base in a sequence determined?
By adding ONE dNTP at a time to a rxn mix and detecting any light produced
Light produced = the added dNTP was incorporated = next base in seq!
No light produced = the added dNTP was NOT incorporated = NOT the next base in seq!
What does light prod. vs no light prod. mean in pyrosequencing?
Light produced = the added dNTP was incorporated = next base in seq!
No light produced = the added dNTP was NOT incorporated = NOT the next base in seq!
What is 454-Pyrosequencing?
An improved + adapted version of pyrosequencing using automation and FLOW CELLS
What is the process of 454-pyrosequencing?
1) DNA of interest is collected and then SHEARED (fragmented) = smaller DNA fragments
2) Nucleotide adapters are ligated to the DNA fragments
3) DNA frags w/ adapters are immobilized on agarose beads
4) Oil emulsion PCR is conducted for EACH bead == beads w/ multiple attached copies of a given DNA fragment (each copy is the SAME on a given bead)
5) Beads w/ DNA attached are distributed over a flow cell picotiter plate == one bead in each well
6) Computer signals for ONE specific dNTP to be released + passed over the plate
7) Any beads that incorporate that dNTP into their growing strand produce light
8) Light emitted for each well is detected via CCD camera
9) Detection info sent to computer: computer adds that specific base to the recorded sequence for cells in which light was detected from
10) After a given round, APYRASE removes leftover unincorporated dNTPs AND any ATP remaining in the plate
11) Process repeats
Emulsion PCR
+ What is its general process?
PCR conducted within individual water (or aq.) droplets
1) Oil (nonpolar) is added to a sample of DNA fragments (polar)
2) Emulsion forms: DNA frags w/ PCR materials separated into individual aqueous droplets
3) Individual PCR rxns occur in each droplet
Flow Cell Picotiter Plate
A slide with ~2 million wells with volumes of 75 picoliters
Limitation of original form of pyrosequencing
Original form could only sequence 300-500 bases; could only conduct individual short-length reads
What is pyrosequencing mainly used for today?
Its uses largely got replaced by other methods of High-TP sequencing
Used today mainly for:
1) Resequencing
2) Sequencing of genomes for which sequence of a close relative is already known
Names of common High-TP methods (6):
1) Illumina (solexa) sequencing
2) Nanopore DNA sequencing (MinION)
3) Single Molecular Real-Time Sequencing (SMRT)
4) DNA nanoball sequencing
5) Sequencing by ligation (SoLiD)
6) Ion semiconductor (ion torrent)
Shotgun Sequencing
A technique used for sequencing ENTIRE genomes by shearing DNA into short fragments that are then sequenced + fragment sequences are ordered via computer programs
Shotgun sequencing is NOT ________________________!
–> It is a process for…
Shotgun sequencing is NOT a defined method for generating sequence data
–> It** refers to the processing of microbial DNA** for genome sequencing
Process of shotgun sequencing:
1) Genomic DNA is collected + SHEARED (fragmented)
2) Fragments are then sequenced by SOME chosen method
2a) Sanger sequencing; fragments individually cloned into vectors and then sequences determined individually
2b) High-TP method; many of the frags are sequenced at once!
3) Sequence data for all fragments is compiled into computer and then algorithms are used to identify regions of sequence overlap
4) Total sequence is determined by putting all fragment sequences in order!
In shot gun sequencing, how is a full sequence determined?
By aligning fragment sequences depending upon identified regions of sequence overlap with other fragments!
How much data is needed for FULL coverage of a genome by shotgun sequencing?
WHY?
~10X more data than the actual size of the genome being sequenced!
–> Because total sequences are determined from a RANDOM distribution of fragments
What is a method that can be employed to increase coverage in shot gun sequencing?
Combining data produced by methods/instruments that give SHORT (> precise) and LONG (error prone) reads!
–> The long reads provide SCAFFOLD of the genomic sequence
–> The short reads help correct any errors in the scaffold + fill in any small gaps
Why is shotgun sequencing important?
Allows us to sequence full genomes (via fragmentation) that otherwise wouldn’t have been possible
Has allowed for genomic sequencing of a > # of organisms
–> As such, has provided new insights into microbial variation
What is an example of microbial variation demonstrated by shotgun sequencing?
E.coli strains differ in genome size by as much as 30%!
Pan-Genome
Collection of ALL genes within a species of bacteria
(Collected from multiple representatives of a single species or clade)
The pan-genome of E.coli contains…
contains ALL genes found in EVERY known strain of E.coli!
High-TP methods allow for the production of ___________________ of DNA sequence data BUT a major challenge after this DNA is collected is…
High-TP methods = allow for prod. of LARGE AMOUNTS of DNA sequence data
BUT a major challenge after this collection is ANALYZING all of that data!
Bioinformatics
An interdisciplinary field that uses computers to analyze large amounts of biological data (Ex: DNA + protein sequences)
Bioinformatics is involved with the production of…
Production of methods + software for large-scale DNA sequencing data analysis
Annotation
The use of computer programs + algorithms to PREDICT the beginning and end of ORFs (protein-encoding sequences) in DNA sequences
ORF
Open Reading Frame
== Sequence of DNA that can be translated to produce a polypeptide
From predicted ORFs, what can be further predicted?
What can we do with this further prediction?
Predicted AA sequence can be derived from a predicted ORF
== we can then COMPARE this predicted AA sequence to known protein sequences to predict:
1) What products (class of protein) a given sequence may produce
2) The potential function of the predicted protein
After predicted ORF is converted to predicted AA seq. + compared to known proteins, what does the comparison tell us?
Based upon observed similarities we can predict a new protein’s function!
(> seq. similarity = more likely to share a similar function!)
What predictions can be made from annotation analysis/comparison?
Example?
Predictions regarding the TYPES/CLASSES of proteins a suggested ORF may produce
Ex: Predicting that a sequence encodes for a TF or transporter (these are classes of proteins)
Bioinformatics aids in target ____________ BUT these predictions must be _______________ via experimentation
Bioinformatics aids in target identification
BUT these predicted targets must be VALIDATED via experimentation
What are the limitations of bioinformatics?
1) Predicted ORFs may not be able to be designated to a general class of protein
2) Identifying biomolecules that predicted proteins interact with CANNOT be reliably predicted
3) Many discovered genes (ORFs) encode for proteins of entirely unknown function
How many predicted genes in a given newly sequenced genome are of unknown function?
How many of these genes are unique?
~1/3 of predicted genes are of unknown function!
~1/4th of those genes are UNIQUE (not found elsewhere)
Functional Genomics
Field of study to determine biological functions of unknown genes determined from DNA sequence data
Most projects of functional genomics utilize what?
Utilize MUTAGENESIS to study function
Why is mutagenesis used to study functionality?
By examining phenotypes of mutants clues can be provided regarding the normal role of a gene + its product
Genomic Library
A collection of cloned DNA fragments that represents the ENTIRE genome of an organism
What are the two main types of genomic libraries?
TRUE genomic library
+
cDNA library
True Genomic Library
Library generated via shearing of a genome and then cloning of those sheared fragments
Process of generating true genomic library:
1) Isolate DNA from a cell culture
2) Shear/fragment DNA with restriction enzyme
3) Digest selected vector with SAME restriction enzyme
4) Ligate DNA fragments + vectors == recombinant vectors
5) Transform recombination products into host cells (E.coli)
6) Select for colonies with clones that have successfully been transformed with recombinant vectors
7) Pool the selected colonies = library formed!
Process of generating cDNA library:
1) Isolate mRNA
2) add reverse transcriptase to isolated mRNA
== mRNA acts as template strand to produce DNA strand by reverse transcriptase
== cDNA is formed!
3) Clone the cDNA into vectors
4) Transform into host cells
5) Select for successfully transformed host cells
6) Pool colonies == cDNA library!
cDNA library
A collection of cloned cDNA fragments
== A collection of only (actively) EXPRESSED genes of an organism
cDNA
Complementary DNA
–> Lacks non-coding regions (introns) and non-actively expressed regions of a genome!
What is the # of clones needed to ensure that every fragment of an original chromosome is present within a library?
N = ln ( 1 - P) / ln (1 - f)
P = prob of generating complete library
f = avg. size of cloned fragments / total size of genome
ln (1 - (avg. size of each cloned fragment/total size of genome))
The number of clones required to produce a complete library is ultimately dependent on what 2 factors?
1) Size of source genome
2) Size of each individual cloned fragment