Week 5 (Long Read and Element) Flashcards
ability to resolve a _________ structure is dependent on the length of the molecules in your library
repetitive
segmental duplications
“low copy repeats” blocks that range from 1 to 400 kb in length, occur at more than one site within the genome, and typically share a high level (>90%) of the sequence identity.
segmental duplications make up about ____% of the human genome
5
long read technology
- Oxford nanopore (ONT) (protein nanopores)
- Pacific BioSciences - PacBio (SMRT)
- proximity ligation (assembly)
ONT
Oxford nanopore (protein nanopores)
SMRT
single molecule real time sequencing
__________ is a heptameric protein pore with an inner diameter of a few nanometers
a-hemolysin
the diameter of a-hemolysin is the same scale as many single molecule, including DNA. Why?
so that DNA can be extruded from the membrane
ONT (protein nanopore) can be used real time in the field. 10-20 Gb are read in less than _______
24 hours (standard is 72 hours)
where is a-hemolysin derived from?
it was discovered in staph, the pathogenic organism uses this protein ore to penetrate cells in the body
how is DNA extruded from the cell using protein nanopore?
the pore is in the membrane, there is a tether that holds the DNA on the pore and a motor protein allows the DNA to move through the pose
How do we read the bases as the exit the protein pore?
as the DNA goes through the pore, each base has its own structure that will disrupt the charge in a base specific way (ion current), so we can estimate what is coming out based on the change in charge
using protein pores, _____ bases are read per second
400
long read sequencers are important for resolving ________ sequences
repeat
____ Mb is the largest long read that has been read (the largest read is the largest chromosome)
4.2
selective sequencing
the protein nanopore is able to chose only the sequences that we are interested in, it will reject and eject the molecule if it has seen it already and then restart with a new sequence
why is selective sequencing a really great tool?
it will save time and resources
what can protein nanopores read in one read?
- bases sequenced
- bases inserted
- bases deleted
- SNVs
- CpG methylations
centromeres are found in the ________ of the chromosome
middle
telomere are found on the _____ of the chromosome
end
what is the difference between illumina sequencing and ONT (protein pore)’s average read length?
- illumina = 150 bp
- ONT = 33-35 kb
what is a major benefit of ONT (protein nanopore)?
read length
what is the average read length of ONT (protein nanopore)?
33-35 kb
in ONT, about _____ bp/sec
400
what type of error occurs in ONT (protein nanopore)? what is the accuracy?
- homopolymers
- accuracy: >99%
homopolymers
types of repeats
SINEs: _____ bp
LINES: _____ bp
- SINEs: 500 bp
- LINES: 5,000 bp
What type of technology do LINEs use to get read, otherwise they will be mostly inaccurate?
LINEs will be read by long reads, short reads would mostly be inaccurate
ONT resolves the problem of ___________ __________ while illumina does not due to short reading
segmental duplications
nano-wells called zero-mode wave guides (ZMWs)
detecting wavelength, physically restricting the light when excited with a laser
where is the fluor in PacBio to read the sequence? what is its formal name?
on the phosphate of the nucleotide (phospholinked nucleotide)
the mean read lengths of pacbio are >_____ kb
20
what makes pacbio really accurate?
CCS / high fidelity system, it is able to go over the strand multiple times
would you rather have systematic error or random error?
random error is better than systematic because you can
overcome random error with depth of sequencing
(rerunning it) than systematic error that will not be
overcome
what is a Q value?
probability of error per base
PacBio reads ~____bp/sec
10
what are common errors in PacBio?
homopolymers and indels
what is a homopolymer?
same sequence repeated over and over again
what is an indel?
(insertions and deletions) - a genetic mutation that occurs when one or more DNA bases are inserted or deleted from a genome
what is PacBio’s accuracy?
> 99%
what is the common error type in illumina? Why?
base substitution because it sequences one nucleotide at a time
what is the common error type in ONT? Why?
homopolymers and indels because it is working so fast that it isn’t catching all of the sequence as it records the charge
what is the common error type in PacBio? Why?
homopolymers and indels because it is working so fast that isn’t catching all of them
what is “proximity ligation: Hi-C”?
it is a library preparation step, genome scaffolding
when a linear line is coiled up, is it more likely to be closer to something a couple Kbs away or several mb away?
it will be closer to something a couple kb away (the closer in linear space, the closer in 3D space it should be)
when DNA is compacted into chromatin (a 3D structure), the DNA that is close together is ______________, trapping sequence interactions across the entire genome and between different chromosomes
cross linked
what does cross linking do?
cross linking traps sequence interaction across the entire genome and between different chromosomes
crossliked DNA is fragmented with ____________
endonucleases
what is proximity ligation?
after crosslinks are fragmented, they are then biotenelated and ligated creating chimeric junctions between adjacent sequences
what is proximity ligation used for?
to assign context to chromosomes and order and orient them along chromosome scale scaffolds
_______ is currently the long read technology of choice
PacBio
PacBio increases read lengths and increases throughput = ___________ cost per genome
decreasing
ONT has extremely long reads but relatively few making it ___________
expensive
which technology has no upper limit on the size of template that can be sequenced giving it huge potential?
Oxford nanopore (ONT)
what is the purpose of HiC?
order and orient contigs
Element is a competing ________ read sequencing
short
in element, an ______ is a dye-labeled polymer with multiple nucleotide arms carrying the same nucleotide base
avidite
how are the bases detected in element sequencing?
florescent signals in 4 channels correlate with A, T, C, or G avidities.
steps of element sequencing:
- bind avidite
- wash away unbound avidites
- bases are detected
- remove avidite
- step and block
- remove blocks
- repeat
element is extremely good per base accuracy and does well with ____________. This is a great method for reducing error.
homopolymers