Genome Sequencing Flashcards
Sequencing a genome can be viewed as…
Obtaining the parts list of the cell/organism.
What 4 things are required for DNA synthesis?
DNA polymerase, primer, template DNA and dNTPs.
Describe DNA sequencing by the chain termination method (aka Sanger Sequencing)
- Template DNA, primer, DNA polymerase, dNTPs and ddNTPs added to reaction in gel electrophoresis
- Different size fragments are generated during DNA synthesis depending on the location of ddNTP incorporation/termination.
- Reaction stops when ddNTP is added, which helps determine the order of nucleotides
Describe fluorescent-labelled ddNTPs and capillary electrophoresis
Fluorescent ddNTPs determine which ddNTP has been incorporated in sequencing reaction
- As the DNA fragments exit the capillary, they pass through a laser detection system. The laser excites the fluorescent dye attached to the ddNTP at the end of each fragment, causing it to emit light at a specific wavelength corresponding to one of the four bases (A, T, C, or G). The emitted light is detected and recorded as a peak in a chromatogram.
Why is capillary electrophoresis better for fluorescent-labelled ddNTPs
Sequencing reactions are run on capillary gel electrophoresis (better heat dissipation and resolution, less sample required, more parallel reactions run at a time)
What is automated base calling?
- Form of Sanger sequencing, scanner records coloured images of different sized termination fragments for each fluorescent-labelled ddNTP
- Computer processes fluorescent signals to generate an electropherogram, assigning a base to each peak.
In automated base calling, what is Phred?
- Phil’s revised editing program
- Electropherograms are usually messy, so Phred estimates a probability of error for each base call in the electropherogram
- Error % is based on parameters such as shape of a peak, spacing between peaks, height of a peak.
What are automated sequencers?
- All steps from sample loading to base calling is automated
- Sequencing reactions are usually performed manually in 96-well microplates in a thermal cycler (denaturing, annealing, extension)
- Using machines like the “Applied Biosystems 3730xl DNA analyzer”, we can obtain up to 800 bp of sequence/reaction.
When doing Sanger sequencing, why is the reaction only limited to obtaining up to 800 bps of sequence per reaction?
The polymerase falls off (the polymerase has a certain sensitivity)
Describe the Human Genome Sequencing Project, specifically the public and private sectors involved in the project
- Advances in automated sequencing allowed for genomic projects such as the human genome project.
- Project formally proposed in 1985 with NIH and US Department of Energy with a 15 year and $3 billion plan (public) consisting of international genome sequencing centers
- Private consortium (Celera Genomics) started second project in 1998 to complete genome sequence in three years (very-profit driven, and you can’t patent anything made by nature)
In the Human Genome project, the DNA came from anonymous donors of diverse ethnic backgrounds. Why?
This system was better than a lottery system to sequence some random person’s DNA because it helped us determine how similar human genomes actually are (since the genome hadn’t been sequenced by that point). Found that humans have a pretty similar genome, with only some different alleles to account for our different phenotypes.
Describe the shotgun approach to sequencing
The shotgun approach requires breaking the genomes into smaller fragments or clones and sequencing these fragments
Random shearing/sonication in sequencing
Randomly breaking the fragments of the chromosome into random bits, and fragments are sequenced independently
You need many copies of the same fragment to perform Sanger sequencing and to accurately see fluorescence. What can be used (in general) to accomplish this?
Cloning vectors
What are the 5 common features of a vector?
- Promoter: constitutive (always on)/inducible
- Multicloning site: unique restriction sites of inserting gene
- Epitope tag: protein purification/localization
- Origin of replication: determines copy number (also ensures that both daugheter cells have the vector)
- Selectable marker: antibiotic resistance (used to identify which E/coli actually has the vector)
Phagemid
1 kb insert
Plasmid
up to 10 kb insert
P1 clone
100 kb insert
Bacterial Artificial Chromosome
up to 300 kb insert
What are the steps of hierarchical shotgun sequencing (3 steps)
- Chromosome is fragmented by partial restriction digest or shearing (sonication)
- Clone the unique fragments into BACs (300 kb), PACs (100 kb) and cosmids (50 kb), and transform into E.coli (DNA library which contains all the colonies together. Each colony contains one vector)
- Map the correct order of cloned fragments to select BACs for sequencing (all genome is represented).
What is the goal when mapping the correct order of BAC clones?
Sequence the minimum number of nucleotides to cover the entire genome to cut costs (i.e. don’t want to sequence multiple BACs containing the same region of the genome)
What are two ways of detecting BACs with overlapping genome sequences?
- BAC library screening by hybridization
- Restriction fingerprinting BAC clones
Describe the steps for BAC library screening by hybridization
- Rapid identification of overlapping clones using a random sequence/probe (single stranded DNA)
1. BAC colonies are robotically transferred to nitrocellulose/nylon membrane and screened with a radioactive probe
2. Probe will only hybridize to BAC colonies with overlapping fragments. Black spots show where the probe is bound (black due to the radioactivity of the probe)
3. The sequence at the end of a clone can be used as a probe in a subsequent screen to look for overlapping fragments: “chromosome walking”
Describe restriction fingerprinting of BAC clones
- Complete restriction digest of BAC clones followed by gel electrophoresis to determine restriction fragment profile for each BAC clone
- Identify BAC clones with common restriction fragments
-Overlapping patterns indicate that two BAC clones share common DNA sequences, allowing researchers to identify overlaps between different clones - By comparing the overlap of many clones, scientists can begin to determine the relative positions of BAC clones along the genome.
Describe hierarchical shotgun sequencing after original BAC cloning (3 steps)
A BAC contacts 300 kb of base pairs which is still very big. The goal is to make these BACs smaller so that they’re easier to sequence
1. Shear BACs by sonication (unique fragments)
2. Clone the fragments into phagemids (1 kb) or plasmids (2-10 kb) and transform into E.coli (“shotgun library”).
3. Sequence library clones, and assemble genome.
Describe whole genome sequencing (which was done by Celera, the private company in the human genome project)
- DNA extraction
- DNA fragmentation (sonication)
- Clone into vectors, transform bacteria for replication, purify vectors
- Sequence library clones and assemble genome
What are the advantages and disadvantages of WGSS compared to HSS?
HSS: Easier to assemble genome sequence but have to build physical map (labor intensive)
WGSS: Bypasses physical map (mapping where the BACs are and any overlap), but assembly of the genome is more difficult especially for more complex genomes (like the human genome)
How did Craig Venter (founder of Celera) cheat when doing the human genome project?
Each time a new sequence was found, it was put into the NCBI database. Venter used this public information to help him assemble the entire genome. This shows how profit-driven Celera was.
What is genome coverage?
How many times a genome is sequenced (because nucleotides are resequenced often)
Coverage formula
C (coverage) = LN/G
L: sequence read length in bp (# of reads you get in a reaction)
N: Number of reads sequenced (aka number of clones)
G: Haploid genome length in bp
What is the assumption concerning the genome coverage formula?
Sequencing reads will be randomly distributed in the genome (i.e. the ability to sequence a particular region of the genome does not differ)
Given a genome size of 5Mb, what would 1X and 2X coverage be?
1X= 5 Mb
2X = 10 Mb
An insert is usually sequenced from (one/both) end (s)
Both
Since the insert is sequenced from both ends, what are these sequences called?
Paired reads/mate pairs
What are universal primers?
Used when sequencing inserts in vectors because we already know the sequence of the vector
Greater length of sequencing reads is better for…
Aligning sequences and better coverage of the genome
- More overlap
How many clones would have to be screened for 1X coverage of a 4 Mb genome with paired reads of 500 bp each?
N= CG/L= (1)(4x10^6)/1000 = 4000 clones