module 2: reading the genome Flashcards
Who is the father of DNA sequencing?
Frederick Sanger
Frederick Sanger’s first project was in 1953 where he was elucidating the structure of ___________.
He won a nobel prize in chemistry in ______ cause he showed that:
Insulin
1958; proteins have defined patterns of amino acid residues
Who is Robert W. Holley?
Robert W. Holley won a nobel prize in 1968 for deciphering the structure of alanine transfer RNA (tRNA).
*there were first attempts at RNA sequencing in 1960s
How many years did it take researchers to determine the nucleotide sequence of alanine tRNA?
5.5 years!
It took them 3 years to purify 140 kgs of yeast to get 1g of alanine tRNA.
Then, it took them 2.5 years to sequence.
After Sanger joined the Medical Research Council in 1962 and worked with researches such as FRANCIS CRICK, there were two new techniques that transformed the field of sequencing in 1976.
What are they?
Chain Terminator (Sanger and Coulson) - DNA polymerase extends a radioactively labelled primer with ddNTPs and fragments are separated on agarose.
Chemical Cleavage (Maxam and Gilbert) - longer radio-labelled DNA cut into smaller pieces and separated by agarose.
*Sanger sequencing dominated the field.
Sanger sequencing relies on the use of ddNTPs, also known as chain-terminators.
How are ddNTPs different than dNTPs?
ddNTPs are missing OH on the 3’C.
This OH reacts with 5’ phosphate to form a PHOSPHODIESTER bond that links two NTs together.
Missing OH = can’t add NT! Synthesis can’t continue.
How long did it take to sequence one nucleotide before the two new techniques were found?
1 month per nucleotide
Briefly describe how the non-automated sanger sequencing worked.
Four tubes were used, each containing DNA polymerase, dNTPs, templates, and primers.
Distinct ddNTPs were present in these four tubes. These ddNTPs randomly labeled every potential position on the template.
Then, gel full of radio-activity was run and exposed to X-ray film for 24 hours and was developed.
Couple days of work would generate 100-500 base pairs of info.
What is base call?
Identity of bases that we can derive from analyzing either the graph, gel, etc.
Differentiate the migration direction vs the read direction of an agarose gel (Sanger sequencing).
Migration direction: largest fragment to shortest
Read direction: shortest fragment to largest
5’ of base call is the shortest fragment
(T/F) The first DNA genome was sequenced in 1977 and there were improvements occurring in 1986.
True!
*first ever to be sequenced was RNA in 1976.
*improvements done by Leroy Hood including fluorescent ddNTPs.
How is automated sanger sequencing different than non-automated?
- Use of fluorescent ddNTPs instead of radioactive
- Perform all four reactions in the same tube (reduces cost, time, automated)
- DNA fragments separated by CAPILLARY electrophoresis (more precise, automated)
- Reads up to ~1kb/day
(T/F) The Department of Energy was seeking data to protect the genome from the mutagenesis effects of radiation in 1986. Hence, scientists at the NCHGR proposed to sequence the genome in 1988.
True!
National Center for Human Genome Research was lead by Dr. James Watson.
Sequencing the genome was thought to be ________, _______, and ________.
Impractical, impossible, overambitious
What were the three challenges of sequencing the human genome? Describe each briefly.
Challenge #1: Reliability
- traditional gels were providing 100 bp of sequences. we would need to run 30 million gels for 1x coverage!
Challenge #2: Availability
- most clones (template to be sequenced) were randomly derived and didn’t have material for entire genome. need to generate a library of clones that span the entirety of the genome.
Challenge #3: Assembly
- BIGGEST CHALLENGE!
- Have to fragment the entire genomic DNA into millions of pieces and must put them back in the correct order
What is the International Human Genome Sequencing Consortium (HGP)?
What did they propose?
20 research centres from UK, USA, France, Germany, China, Japan, and India came together to form this Consortium.
They proposed to sequence the EUCHROMATIN region of the genome in 15 years with 3 billion dollars.
What were the 5 goals of the HGP Consortium?
- High-resolution genetic map (based on recombinant frequencies)
- Physical maps (based on distances) of all human chromosomes and of the DNA of selected model organisms
- Determination of the complete sequence of human DNA and of the DNA of selected model organisms
- Development of capabilities for collecting, storing, distributing, and analyzing the data produced
- Creation of appropriate technologies necessary to achieve these objectives
Who created The Institute for Genomic Research (TIGR)?
Why?
Craig Venter created TIGR.
He wanted to patent genes at NIH once he developed Expressed Sequence Tag (EST) to identify genes but wasn’t allowed.
What was the faster method of sequencing that Craig Venter developed in TIGR?
Whole genome shotgun sequencing.
Which genomes did HGP consortium sequence in 1996, 1997, and 1998?
1996: Yeast (12 Mb)
1997: E. Coli (4.7 Mb)
1998: C. elegans (97 Mb)
What is Celera Genomics?
Why was it founded?
What did they propose?
Celera Genomics is a “for profit genomics” that was founded by Craig Venter to patent genes.
It was founded because Craig hated the way human genome project was managed. NIH rejected funding for his influenza project and his group was left our of funding to work on the genome project.
Celera Genomics proposed to sequence the human genome within 3 years in 1998!
Celera genomics sequenced which genome in 1999 and what did this do?
They sequenced the D. melanogaster (160Mb) in 1999.
This progress from Celera Genomics pushed the government project to re-double their efforts.
What was the 20th century’s last great scientific contest?
The race to sequence the human genome!
Public vs Private
What is the difference between sequencing DNA and sequencing genomes?
Sequencing DNA: obtaining a sequence of NTs of a gene or a segment but do not know where it belongs in the genome
Sequencing genomes: determining the identity of all 3 billion bps in order of p arm to q arm of all chromosomes. (where does the DNA go?).
(T/F) Hierarchical sequencing used by public and whole-genome shotgun sequencing used by private differ the most in the assembly process.
True!
What were the three main steps of hierarchical sequencing?
- Selecting (the BAC clones aka pre-sequencing)
- Sequencing (the chosen clones)
- Assembling (individually sequenced clones into an overall sequence)
To select clones for sequencing, a library had to be created. How were these created?
How much coverage did they have?
DNA was received from anonymous males (X and Y chromosome).
There was partial digestion of the DNA with RESTRICTION ENZYMES to create large fragments.
These were cloned into BACs and PACs.
This generated 8 libraries with 1.5 millions of clones.
~65 fold coverage
1) What is coverage?
2) How is it calculated?
3) Why did we need a high coverage for the genome project?
1) It is the relationship between SEQUENCES (BAC insert, sequence read) and a REFERENCE (a specific position, a locus, chromosome 1 or the entire human genome).
It describes HOW OFTEN, on average, a reference sequence is covered by bases from the reads.
2) Coverage = (avg insert size x # of BACs)/haploid genome size
3) Abundance of starting material so every sequence we want to maintain would be present in our library.
Describe the clone fingerprinting technique.
Clone fingerprinting technique is used to map overlapping clones.
1) Digest clones in BACs with RESTRICTION ENZYMES
2) Separate fragments by agarose gel electrophoresis and look for bands in common
If there are clones that share a common sequence, they are overlapping.
*this was done for 300k BAC clones in the hierarchical sequencing
Fingerprint clone contig is a set of __________ clones that correspond to a ________ chromosomal sequence.
Overlapping; linear (contiguous)
Once overlapping chromosomes were identified via fingerprinting, how were their precise locations in the genome determined?
Physical mapping using SEQUENCE-TAGGED-SITE (STS)!
These are known sequences (~100-500bp) that is unique in the genome.
PCR amplifies specific STSs for each clone in the genome. If a clone is positive for STS X, we know that the clone comes from wherever STS X comes from.
What does it mean when multiple clones are positive for the same STS?
These clones are overlapping because they both express the STS!
Minimal tiling path/golden tiling path is used for …..?
What are they?
Used for choosing clones for sequencing.
These are a set of overlapping DNA clones that cover an entire genomic region with the minimum number of clones required.
How are the chosen clones sequenced in hierarchical sequencing?
- Mechanical fragmentation of the BAC insert DNA into ~1kb pieces
- Cloning of the DNA fragments into plasmids (vectors)
- Sequencing
Why are restriction enzymes not used for fragmenting the BAC inserts when sequencing?
Restriction enzymes target a specific sequence, causing all fragments to have the same ends.
We want to generate as many random ends as possible so we are able to link them together during assembly.
What are sequenced clone contigs? How are they different from fingerprint clone contigs?
Sequenced clone contigs are made with SEQUENCED clones with overlapping end sequences.
Fingerprint clone contigs rely on the sizes of DNA fragments in clones to establish their order and do not involve direct DNA sequencing. Sequenced clone contigs has actual sequence info.
*after selecting and sequencing, sequenced clone contigs are made.