Lecture 07: DNA sequencing Flashcards
Human genome facts
Human genome length (nucleotides)?
- 3.6 Gb
Human genome length (metres)? ~2 m
Human genome mass?
- 0.000000000003 g
What can be sequenced?
- Whole genome (de novo) sequencing (Re-sequencing)
- Targeted (SNP, RAD, exome)
- Single individual vs.
- Multiple individuals (poolseq)
- Multiple taxa (metabarcoding, metagenomics)
- RNA → DNA (transcriptome)
Exome: composed of all the exons that remain after splicing -> all sections that potentially code for proteins
De-novo sequencing - vocab
- Reads = original sequences, pc compares them and then organises them after similarities -> the longer the better for analysis
- Contigs are just multiple reads strung together
- Scaffold: how far the distances of the sequences are from each other
Notable acheivments
- 2008 first human genome sequencing through parallel DNA sequencing
- 2010, 185 low coverage human genomes, 697 exomes
- 2010 first Pleistocene human genome
- 2014, 48 bird genome assemblies
-> Conclusion: learn bioinformatics & programming because the data sets get bigger and bigger
Technology timeline
- 1975: Sanger sequencing
- Automated Sanger sequencing
- 2005: “Next generation” Sequencing (NGS)
Sanger Sequencing
- first DNA sequencing method and used for 30 years
- Like most sequencing methods, it is template based
- start with a single strand of DNA, which is produced by using a single primer
- You set up four “sequencing reactions” wich contains:
- DNA template
- Primers
- Nucleotides
- A small proportion of one 32P-labelled dideoxy nucleotide (A,T,G or C)
- The di-deoxy nucleotides stop extension of the DNA chain
- different chains will be different lengths -> they can be separated by gel electrophoresis
- Separate your four reactions on four lanes of a large gel
(Later 4 different fluorophores were used meaning you could run 1 sample per lane)
Pre- versus post-NGS
- Sanger sequencing:
384 reads up to ~300,000 bp - Roche 454 sequencing (2005) :
300,000 reads up to 20,000,000 bp
Current sequencing technologies: Illumina
- Sample prep
- Bind DNA to flowcell, generate clusters
- Sequencing by synthesis
- Data analysis
Illumina sequencing more detailed
- Ligation of adapters to each end of the DNA molecule
- Single strands are coupled to glass slides, via adaptors
- bridge amplification, “PCR colonies” or “polonies”/cluster
- For subsequent sequencing, nucleotides are blocked, so no more than one can be incorporated per cycle
- Four fluorescent dyes for each base allow detection via pictures
Illumina- Considerations
- cluster density: Under-Clustered, Optimal clustered, Over- clustered
-
read lenght and qualtity: * High throughput
* High sequencing quality
* Limited read length (to some extent – up to 2 x 300 bp now possible) - Assembly is a problem
-> lowest cost per base, but full run cost $10.000
Advantages: high throughput and high sequencing quality, relatively cheap
Disadvantages: limited read length, quality declines with higher read lengths
Hi-C sequencing
- Based on Illumina sequencing
- Uses chromatin conformation information
- Allows better scaffolding
Example: Chinese mitten crab
Newer technologies
- Pac Bio
- Oxford Nanopore
- (Bionano)
Primary focus: increase read length →Improved genome assemblies
Pacific Biosystems (PacBio)
- designing a library: circular template by ligating adapters on dsDNA
- add primer and polymerase to the sample
- SMRT- Cell with Zero Mode- Waveguides
- each sample in one ZMW
- with every labeld nucteotide incorporated by Pol. -> light is emitted
Real time sequencing
Pacific Bioscience
- Strictly, single molecule reaction monitoring
- No washing: cheap on reagents
- No stop-and-go synthesis as in other systems
- Recent upgrade to 8x more reads per run
- Read length is up to ~40,000 bases
- Initially high error rate
- Highly competitive for **long reads
Pacific Bioscience HiFi
- Strictly, single molecule reaction monitoring
- No washing: cheap on reagents
- No stop-and-go synthesis as in other systems
- Repeated sequencing of circularized molecule
- Read length is “only” ~15,000 bases on average
- Very high accuracy
- Excellent for de-novo assembly
Nanopore technology
Proposed and started in the early 1990’s in Santa Cruz and Harvard!
* Based on threading a single strand of DNA through a microscopic hole in a membrane
* Creating an electric field across the membrane causes the DNA to pass through
* Measuring the electrical properties of the hole (capacitance), should tell you which base is passing through it
* Resolution has proved to be a bit of a problem…but it is now also excellent – up to 99.9%
Oxford Nanopore principle
Array of microscaffolds
Each microscaffold supports a membrane and embedded nanopore.
Sensor chip
Each microscaffold corresponds to its own electrode that is connected to a channel in the sensor array chip.
ASIC
Each nanopore channel is controlled and measured individually by the bespoke ASIC. This allows for multiple nanopore experiments to be performed in parallel.
Oxford Nanopore in detail (Picture)
Some applications for Nanopore sequencing
- Genome assembly
- Detection of structural variants, e.g. long (> 1kb) tandem repeats
- RNA seq analysis of splice isoforms
- Real-time detecting of pathogens (e.g. Ebola in recent epidemic)
Oxford Nanopore sequencing summary
- Extremely long reads (N50 of 50 kb, even > 4 Mb have been reported)
- Directly accessing original DNA molecules
- Capable of distinguishing modified bases directly
- Comparatively cheap and fast, especially sample preparation
- Small versions portable
- Originally low accuracy, now also > 99%
Conclusion
- DNA sequencing throughput has increased massively and continues doing so
- NGS has transformed numerous fields of molecular biology
- Data analysis and storage is the current bottleneck
- New technologies focus on read length
- Long read sequencing for de-novo assemblies, Illumina for re-sequencing