Module 3.1 The Human Genome Project Flashcards
Human Genome Project
history
- A large, international scientific effort that generated the first sequence of the human genome and that of selected model organisms
- 20 centers from six countries: US, France, Germany, Japan, China, UK
- Started in 1990, the project was predicted to last fifteen years, with estimated cost of $3B
- initial plan was to finish sequencing the selected model organisms, including yeast, C. elegans, drosophila and mouse and human using automated sequencer
- Celera Genomics, a private company, joined the
race in 1998
Hierarchical Shotgun Sequencing
(BAC-to-BAC sequencing)
- human genome fragmented into large pieces
- large fragments sorted and organized into a physical map based on their relative positions in the genome
- subset of the individual genomic fragments that represent the genome with overlapping sequences are selected and sequenced by random shotgun sequencing strategy
- shotgun sequencing data is stitched back together to get the sequence of the large fragments.
- sequences are assembled to reconstruct the sequence of the entire genome
BAC libraries of Human Genome
Calculation
BAC libraries: ~150,000bp/fragment
Coverage =
Genome size (G) / [Insert length (L) x Number of clones (N) ]
Number of clones (N) needed For 1x Coverage:
Using BAC libraries:
N=3x109 / 1.5x105 = 2 x 104
Using small plasmid libraries:
N=3x109 / 1.5x103=2 x 106
Phred Software Package
assigns a base quality score
base quality score
assesses the probability of an error
- makes it possible to monitor raw data quality and help in determining whether two similar sequences truly overlap.
FRAP computer package
systematically assembles the sequencing data using base quality score
Sanger sequencing high-quality read length
600-900 bases
shotgun sequencing
history
- first proposed in 1979 for sequencing genomes 4,000- 7,000 bp long
- first genome sequenced was 8000 bp Cauliflower Mosaic Virus (1981) by sequencing 175 individual fragments
EcoR1 recognition site
G / AATTC
HindIII recognition site
A / AGCTT
Cauliflower Mosaic Virus Shotgun Sequencing
process
- Make large quantity of the virus
- isolate the DNA
- split DNA into multiple reactions
- in each reaction, you treat DNA samples with one restriction enzyme to get a specific set of fragments
- purify the restriction digested DNA fragments and sequence
- identify overlapping regions by looking for the same sequences in two fragments
Whole genome sequencing
main challenge
- if genome has a lot of repeat sequences, it will be hard to identify overlapping regions with high accuracy
- human genome is 3 billion bp long, more than 50% are repeated sequences
Plasmid cloning fragment limit
base pairs
1,000 - 30,000 base pairs
hierarchical genome sequencing
Bacterial Artificial Chromosome
(BAC)
8
- originally created from F’ plasmid.
- able to hold up to 350 KB of DNA
- origin of replication site (ori)
- antibiotic resistance gene
- restriction sites for DNA insertion
- lacZ gene for blue/white colony selection
- present in only one or two copies per cell so able to keep large fragment stable
- Each colony contains particular piece of the genome
coverage
number of times a given nucleotide in a DNA molecule is represented in the library
- quantifies depth or redundancy of representation for a particular genomic region in library
- common QC matrix in genomic sequencing
- need to cover the genome more than one time with redundancy so that you can ensure the proper sequencing of the entire genome region
hierarchical genome sequencing
BAC libraries for human genome
properties / preparation
- genome fragmented into large pieces, with each piece about 150,000 base pair long
- clone fragments into BAC vectors and generate many different colonies from transfected cells
- BAC colony = Genomic DNA clone
- pick colonies, grow cells, and preserve cells of each clone in a freezer.
- give each BAC clone unique ID for checking
- fragments contained in clones have different ends
hierarchical genome sequencing
genomic DNA library
entire collection of BAC clones
sequencing coverage
number of times a given position in the DNA is read or sequenced
hierarchical genome sequencing
restriction fingerprinting
- digest clone DNA fragment with restriction enzymes and analyze fragment size by gel electrophoresis
- clones can be grouped into subsets, each member of which is related to at least one other member by a significant overlap, suggesting that subsets of clones within a group have a high likelihood of originating from a contiguous region of the DNA
DNA fingerprinting
pairwise comparison
- fragment size of each clone is measured by comparing to the markers
- By comparing the fragment length from two clones, you can identify same-size fragments, indicating overlapping fragments between the two clones
- Two clones are considered similar when they have many matching fragment sizes (aka overlapping fragments)
hierarchical genome sequencing
restriction fingerprint
pattern of various-sized fragment gel bands created when DNA clone insert is digested by restriction enzymes
contig
a set of DNA segments or sequences that overlap in a way that provides a connecting representation of a genomic region
- clone version provides a physical map of a set of cloned segments of DNA across a genomic region
- sequence version provides actual DNA sequence of a genomic region.
- defined by the criteria that each member of a particular subset is related to at least one other member by a significant pairwise overlap within the group
hierarchical genome sequencing
clone selection for shotgun sequencing
- physical map provides information about the order and the relative positions of BACs along the chromosomes
- clones are selected for sequencing to minimize overlap between adjacent clones.
- clone’s restriction enzyme fragments must be shared with at least one of its neighbors on each side in the contig.
- want to minimize the redundancy between the clones within the contig that you can use the minimum number of BAC clones to cover entire contig.
shotgun sequencing of BAC clone
process
- DNA insert in the BAC clone released by using restriction enzyme digestion.
- 150kb-long insert randomly fragmented into 1,500bp pieces
- each fragment cloned into separate M13 plasmid vectors (M13 library)
- M13 plasmid will produce high copies of the insert sequences.
- DNA (including M13 plasmid vector with DNA insert) extracted from bacterial cells and subjected to Sanger sequencing
- anneal primer to M13 circular vector and read toward DNA insert to get a read of about 500 to 600 bases from one end of insert
- After sequencing many unique fragments from M13 library reads can be aligned by the sequence overlap between the reads
whole shotgun sequencing
method used by Celera Genomics
- Bypass step of building a physical map first, go straight to sequencing genome
- faster and simpler process but more challenging to assemble genome
- Multiple copies of genome are randomly sheared into 2,000 or 10,000 bp pieces and inserted into plasmids for growing in bacteria
- purified plasmids are then subject to Sanger sequencing (pair sequencing)
- two sequences oriented in opposite directions and about the length of a fragment apart from each other were valuable in reconstructing sequence of original target fragments
whole shotgun sequencing
pair sequencing
process
- anneal primers to the flanking region on the plasmid vectors and then read toward insert to create paired-end reads each 500bp long (for both 2000 and 10,000bp plasmids)
- mate pair: Read 1(500bp) + unknown(1000bp) + Read 2(500bp)
- align all the read pairs together by sequencing overlap between pairs (both ends)
- use computer aligning and compare reads together to piece together the sequence information to fill out missing sequence
whole shotgun sequencing
mate pairs
sequence reads from the same clone fragments (Read 1 and Read 2)
hierarchical shotgun sequencing
benefits and drawbacks
Benefits
- relies less heavily on computing power and computer algorithms. - fingerprinted BAC map made it possible to select clones for sequencing that would ensure comprehensive coverage of the genome and reduce sequencing redundancy.
- challenge of sequence assembly minimized by restricting random shotgun sequencing to individual clones.
- clone based map also enabled the identification of large repeated segments of the genome and simplified the assembly
Drawbacks
- slower than whole genome shotgun sequencing
- labor intensive