Sequencing DNA Flashcards
primer
provides free 3’ OH for synthesis to begin from (5’-3’ synthesis - required free OH for addition of next nucleotide)
sanger sequencing primer
need to know primer sequence to begin sanger sequencing
many organisms share primer sequences
can then infer the seqeunce after the primer
sometimes more difficult than this
deoxynucleotides
added to growing chain during in vivo DNA synthesis
a diphosphate (PPi) is released and the nucleotide is covalently added leaving a hydroxyl OH available for the next base to be added (further synthesis)
Terminating ddNT
lack 3’ OH
stop synthesis when they are added to growing chain
sanger dideoxy sequencing
supply mix of 3 deoxy nucleotides and one dideoxynucleotide
(only one of bases is dideoxy)
synthesis reaction will terminate when a dideoxynucleotide is incorporated
measaure length of terminated fragment and this tells us where the termination happened
this is not so useful as only gives the first time this base appears
instead supply all 4 dNT and one type of ddNT
radiolabel the NT
run fragments on gel
fluorescent label dideoxy sanger sequencing
radiolabelled ones not so good as radioactive substance
instead use 99% dNT and 1% differently fluorescently tagged (depending on base) ddNT
can then run on gel
measure length and colour of the terminated fragments
this tells us:
-where the terminations occured in the sequence
-what base was involved
can infer sequence from this
automating sanger dideoxy sequencing
run many reactions in parallel capillaries
automatically recording the fluorescence signal
parallel decoding of fliorescence signals
gives us a sequencing chromatogram
with the fluorescence signal at each location and its corresponding base in sequence
sanger dideoxy error rates
use PCR to amplify DNA samples for sequencing (need millions of copies to produce detectable signal)
PCR can introduce errors (about 1 in 10e4)
-meaning occasionally a base is misincorporated 1 time in 10e4
base call quality - reported as the probability of the call being an error
10 - 1 in 10 - 90% base call accuracy
20 - 1 in 100 - 99%
30 - 1 in 1000 - 99.9%
40 - 1 in 10,000 - 99.99%
50 - 1 in 100,000 - 99.999%
sanger sequencing reads usually 300-100 bases in length of >Q30 base calls
- so there is a probability of base calls being an arror
-and it is slow, serial, expensive
illumina sequencing
short read sequencing
is NGS
invention of NGS tech (mainly illumina) caused insane drop in cost per megabase of sequencing
similar to sanger - as in it uses terminators
BUT it is REVERSIBLE terminator sequencing
uses fluorescently labelled “reversibly-blocked” nucleotides
allows the sequence to be read one base at a time
-incorporate fluorescently labelled base
-read fluorescence signal (different depending on base)
-remove block on 3’OH
-remove fluorophore
-then incorporate the next fluorescently labelled nucleotude
commercialised by Illumina
Illumina set up
sequencing takes place in many flow cells
doesnt require knowledge of a primer seqeunce
instead uses adapters and primers are used for those adapters
lawn of adapters stuck to surface of slide act as primers to amplify the DNA fragments
BRIDGE AMPLIFICATION
clusters grow clonally from same individual fragment
need to do this as need to make many copies of sequence so signal can be seen by illumina machine detectors
each cluster identified by physical location on the slide
sequencing is detected by order of colout of fluorescence the cluster
gain a sequence for each cluster
illumina benefits
dont need to know primer sequences
Illumina NovaSeq can generate up yo 3 terabases (3x 10^12) per run
up to 20 billion reads (2x 10^10) per run
150 bases per read each way - 300 total max length
average Q>=30
Illumina drawbacks
other machines can produce longer reads - which are more useful for genome sequencing
so illumin not as good for that as them
errors occur
sometimes systematic - due to underlying properties of sample sequence
long read technologies
2 main players:
-Pacific Biosciences single molecules long read - PacBio SMRT/ SEQUEL
-Oxford nanopore technologies (ONT). Synthetic nanopores and minION, PromethION instruments
promethion = multiple minION put totgether
difference - use fluorescently labelled dNT (no temrinators)
PacBio SMRT / SEQUEL
single molecule sequencing using
FLUORESCENTLY LABELLED DEOXYNUCLEOTIDES
fluorescent label is on the PPi which is removed from dNT when incorporated into DNA chain
ssDNA input
dsDNA out
process of incorporating this fluorescently labelled dNT releases light which can be detected by the machine
-zero mode waveguide illumiation of the polymerase
-real time monitoring of nucleotide incorporation
DNA pol is fixed to bottom of the well
light of diff wavelength released for each base
can detect this peak and infer sequence from the order
outputs of the three types of sequencing so far
sanger - labelled bars
illumina - pictures
PacBio - movie - lots of data to analyse - requires powerful computer
PacBio SMRT data
long - 1kb to 30kb - means of ~15kb
up to 30 Gbase per run
low quality - Q scores of 10-12 - error rate of 1/10 - 1/15
most errors are deletions, indels
but these can be corrected with HiFi
errors are mostly random with reference to underlying sequence so reads tend to correct for each other
PacBio best uses
genome assembly
identifying duplications
identifying splice isoforms in mRNAs
PacBio HiFi libraries
add adapters - no requirement of knowing sample sequence
adapters are in a loop at each end so give a circular molecule when added to end of dsFragment
Add polymersae
Fixed polymerase runs around the circle many times to produce many copies
Oxford nanopore technology ONT
use real cells as starting point
cells have pores allowing trafficking of molecules in and out of cells
can use pores to traffic DNA molecules through
electrical signal will change depending on sequence of the DNA molecule
run current through pore
current changes depending on properties of the molecule passing through pore (size, charge)
different bases have diff size/charge
single membrane protein nanopore is embedded in a synthetic membrane
current passed through pore
use deviations in current to infer sequence
-single DNA molecule passes through pore occluding current flow
-current is affected differently depending on the sequence og about 6-7 DNA bases in the ssDNA
-the sequence is read by modelling the curent useing neural network computing
ONT nanopore sequencing types
minION
-one cell
-500 pores
handheld sequencer
promethion
-up to 48 cells
-3000 pores per cell
production sequencer
ONT data quality
low quality raw data
BUT errors also random so multiple copies correct for each other
long depending on input
means of about 50kb
longest read so far >2Mb
upt to 50 Gbase (minION) and 200Gbase (promethion) per run
low quality raw data - Q11-12 (1/10 - 1/15 error rate)(most errors are insertions and deletions, indels)
ONT nanopore sequencing best uses
same as PacBio
Genome assembly
Identifying duplications
identifying splice isoforms in mRNAs (can directly sequence RNA different to PacBio)
Genome assembly
telomere to telomere
sequence starts at one telomere and goes throug to the other at other end of chromosome
^^ideal genome assembly
problems in getting T to T
rDNA
centromeric satellites
Censat and SDs
SDs (segmental duplications)
RepMask
repetitive units i guess
collapse in assembly to one region??
long read sequencing (pacbio, ont) helped to get T to T of all autosomes and x chromosome
~2022