Week 2.3 Genomic Technologies Flashcards
Genomic technology timeline
- *1869** DNA extraction – No idea what DNA actually did
- *1880s** Microscopy to study chromosome – still not sure what they did
- *1913** Genetic map of Drosophila chromosome- first genetic map, but these were made by looking at different phenotypes in flies, one could see that they were inherited in the way it was linked. By looking at the segregation – but no idea of what it was controlling this linkage map.
- *1953** – DNA discovered by Watson and Crick
- *1970** Site-specific restriction enzymes – start of manipulation of DNA, chop up DNA and get different results
- *1977** Sanger sequencing method – first able to sequence DNA
- *1983** Polymerase chain reaction (PCR)
- *1986** Automated Sanger sequencer
- *1996** Pyrosequencing – commercialised in 2006 in the form of 454 sequencing
- *2006** Solexa (Illumina) and 454 Sequencing
Over the last 10/15 years, there has been a huge progress on the work we can do on DNA
Who led the public project?
How was the public human genome project conducted?
Public project was led by Eric Lander et al
Started with genetic maps of humans by looking at the way different phenotypes linked together on the chromosome
A hierarchical shotgun method was used (BAC-by-BAC approach),
they broke up the human genome into big pieces that they put into bacteria and grew colonies of the (bacterial artificial chromosomes)
They sequenced each BAC, by shot-gun approach, they put them all into Sanger sequences and then they could put them back together.
called shotgun because you hit a random group of lots of bits of DNA Easier to deal with repeats and heterozygozity DNA 7.5X coverage of whole genome, Sanger N50 of 82 Kbp.
Public project approach;
Why did the public project take this approach?
Was only one genome sequenced?
Genetic maps –> BAC’s –> Shotgun sequencing
They thought that was much easier to deal with whole genome and errors in heterozygozity, DNA was collected from a number of individuals. Thus not just sequencing one genome, sequencing a number of sequences, to deal with this problem of polymorphism, they took this route. They had 7.5 fold coverage, i.e 7.5 reads covered each base in the genome, N50 of 83 Kbp, thus over half of their fragments of DNA they were over 82Kbp of DNA. Thus what they published was still very fragmented.
Who led the private project?
What method did they use?
How many indivudals did they use?
With hindsight was this a good idea?
Private project led by Venter et al
- Venter thought the public appraoch was too labour intensive
- Instead of doing BAC-BAC, they would just shotgun sequence the whole genome and then take all there 900 base pair fragments and try to fit them together.
- They did 5.1 X shoutgun coverage and +2.9X from public project
- Shredded into 550bp segments)
- (Used genomes of 5 individuals, using two males and three females – one African-American, one Asian-Chinese, one Hispanic-Mexican, and two Caucasians)
With hindsight, this is a bad idea because you introduce a lot of variability
Ideally, you would not sequence a heterozygous, sequencing 3.2 billion base pairs rather than 6.4 billion that are found in every cell
They didn’t anticipate the drop in cost due to the magnitude of investment in genome sequencing pushing the cost down.
- Mate-pair insert sizes, instead of having random reads they had pairs of length that they knew of 2, 10, and 50 kbp were used – slightly higher N50 86Kb. Some debate over how this was achieved.
Public project method
- Genomic DNA is broken up into BAC libraries
- Then those are each shotgun sequenced
- Placed on a map
Private Project
- Shotgun sequenced the whole of the genome,
- then tried to assemble that back together - by using mate-pair method of known lengths
Island analogue – private project approach
To take an aerial photo of the island by using an aeroplane which can only take a very small area of the Island; this is what DNA sequencing is like because it can only sequence a small fragment each time.
‘Photos’ are taken at random; we just have photos with no idea where they come from.
By overlapong we can attempt to jigsaw them together problem: identical regions
The solution is using pairing technique (Venter project), instead of having one camera you would have two cameras, therefore photos can be compared relative to each other because of known distance between them.
The analogue in the human genome is that regions of the human genome that are very repetitive, thus very hard to place a read of about 900 bases that comes from a repetitive region but if we have that paired from something that is unique, we can then place where this repetitive region is. This was the technique used by the private project within there whole genome shotgun approach.
Island analogue public project approach (Lander project)
Public project approach (Lander project) BAC approach, by breaking up the Island into smaller areas, areas that were much bigger than the size of the actual reads that they would take but still smaller than the whole Island. They would turn them into smaller bacterial artificial chromosomes, and then by growing colonies of bacteria they are able to amplify specific segments of the human genome. Then they were able to do there shotgun sequencing just on the smaller regions of the human genome. Venter Project (private) The red line depicts the human genome, there is coverage some of it in individual sanger reads, that could be assembled together and then there are reads that are different distances apart using the mate-pair technique some of the distances are bigger than others by assembling all of them together they were able to produced human genome sequence. On average, each base pair had about 7 reads of actual Sanger sequence that were covering it. Lander Project (public project) Had an assembly that was based upon the previous assembly of the bacterial artificial chromosomes, the full assembly was compiled together from the longer sequences. Placement of the BAC’s was assisted by genetic markers that had been identified from linkage maps.
What are contigs?
What letter is used to denote unknown base?
Contigs >contig001 CTTCACCTTTTAAGGGTA GGACGTCAGCAATCATGA ATACTTTTTGAGGAAGTC AATATATGCGGATTTCTGTC
Contigs which are sequenced fragments of the genome in which we know what the sequence is.
Particularly in the public project, able to put the Contigs into scaffolds, longer sequence where we know the distance in the sequence but we don’t know exactly what is within that. We know how far apart they are relative to each other but we don’t know what’s between them. The ‘N’ is used to denote the unknown base
Since 2001, how has sequencing improved?
What is the latest version of the human genome?
Since 2001 Human genome assembly has been improved since 2001, the latest version (GRCh38) has N50 of 67,794,873 bp thus half the genome is in Contigs of lengths longer than 67million. In contrast with only 82,000 bp in 2001. This has been enabled by new sequencing DNA technologies.
DNA sequencing technologies Moore’s Law chart
What is moore’s law?
It would cost $100 million in 2001 to sequence one genome. 2012 its about $1,000 It followed Moore’s law up until 2007, but then next generation sequencing technologies came about that took a completely new approach to DNA sequencing
Sanger Sequencing
What is teh sanger sequencing method?
Sanger Sequencing Chain termination method
Replicate DNA as you would in a human nucleus that was about to divide;
As well as normal AGTC nucleotide bases (dNTPs), you add a low concentration of dideoxynucleotides (ddNTPs).
Nucleotides that are slightly modified, the ddNTPs lack a 3’-OH group necessary for the next phosphodiester bond in a DNA chain.
As soon as you incorporate, one of these ddNTP’s the replication the DNA chain stops as it is now impossible to add another base (dNTP) - chain termination
If you are replicating a fragment of DNA many times the chains produced will all stop at different points because the ddNTPs are incorporated at random.
Thus, you get a mix of different length chains is produced
4 Tube Sanger sequencing
How does this work?
How do you copy DNA?
What is used to identify te base?
What was this the first for?
Primer for DNA polymerase that will be copying the DNA.
Then you start copying the DNA, you do this in 4 different tubes;
Tube 1; you have ddATP, thus a modified A base, you are trying to add as well as normal bases
Tube 2; modified C
Tube 3; modified G
Tube 4; modifiedT
(In each tube you have a mix of all the bases)
In tube 1 the fragments of DNA will all stop at A’s.
Tube 2 they will stop at C’s, tube 3 all fragments will stop at G, and tube 4 at T’s
Its random which gets incorporated, because you still have all the bases, so sometimes it will stop at the first base, sometimes third, fourth and so on. This can go on for about 900 base pairs.
You take the product of the different tubes and run it in a gel with an electric current across it, as DNA has a charge the DNA will move across the gel where short fragments will move faster than long fragments.
You can look up, see which base is the final base on each fragment of DNA, You will have a band where replication has terminated in one of the tubes because the final base was the base that in that tube had dideoxynucleotides (ddNTP).
So for the first time this allowed you to read off the gel and know the sequence of the DNA. It was a very clever idea and it meant for the first time you could actually read a DNA sequence.
Capillary method Sanger sequencing
What is different about this?
How are the bases read off?
Capillary method Sanger sequencing A fluorescent molecule is bound to the terminating nucleotides, so A, T, G, C have different colours. This meant you did not have to have 4 tubes, you could have 4 tubes, and the different ddNTP would present different colours. You could read the colours.
Template DNA One of the major costs of the Sanger sequencing methods
What do you require to require for Sanger sequencing termination method?
What is the alternative, and what is the cost associated with this?
What did the post-Human genome porject technologies try to tackle?
What came out in 2004? and end of 2005?
Template DNA One of the major costs of the Sanger sequencing methods, is that you require thousands of identical copies of the template DNA, produced by cloning. Or by PCR with highly specific primer, but this adds to the cost and time of sanger sequencing. This was something that new technologies tried to overcome. After the human genome project, new technologies began to come in. There was new capillary sequencer that came out in 2002 was not step changing technology. In 2004 454 pyrosequencer came out and that led to an increase in the output of DNA sequences. End of 2005 solexa/illumina led to huge increase in the rate we sequence DNA