Week 2 Flashcards
Information
the letters used to write the sequences
Genomes in a double or single stranded states
What is information used for?
we use information in DNA seqeunce to align sequences
Contig
Summary sequence of overlapping DNA fragments
Why can we align DNA sequences
How long does a sequence have to be for us to expect it to be unique?
Information theory and Genetics
The restriction enzyme HaeIII cuts the sequence GGCC. What is the probability that any random four base pair sequence is a HaeIII site?
1/4 chance of getting either base
1/4 x 1/4 x 1/4 x 1/4
(1/4)^n
n=length of the genome
How many distinct permutations in sequence can a stretch of n bases long have?
How many sequences (permutations) are possible for a giving length of a DNA strand (n-bases long) if the bases occur at an equal frequency in the genome
, how many permutations?
how many peices of distinct information can a stretch of n bases have (potential)
4^n
We can expect a sequence to be unique in a genome when the # of permutations is greater than the # of sequences of n bases in the genome
Bacterial genome = 4x10^6 bases
Human genome = 3x10^9 base (single stranded)
What is the # of sequences of n bases in the genome?
What is the number of seqeunces of n bases in the genome?
because genomes are so long the number of sites = the length of the genome if the genome is single stranded
the number of sites in double stranded DNA = 2x the length of the genome
Palindromic sequences are an exception.
We can expect a sequence to be unique in a genome when the # of permutations is greater than the number of sequences of n bases in the genome.
when is 4^n > 8x10^6 (bacteria)
n>=12 in bacteria is unique
when is 4^n > 6x10^9 (humans)
n>=17 in humans is unique
Say I have 9 genomes made of random sequence of 3 X 109 bases. And I search each of these with ATAGACATAGGATACAT. How many would you expect to have the sequence?
6 X 10^9 / 17 X10^9 = 1/3
Therefore, 1/3 X 9 = 3.
I expect about 3 of these random genomes to have this sequence
How much information is there in DNA?
A,T,G,C each letter can arbitrarily represented as:
A=00
G=01
C=10
T=11
the information content at any position in a DNA sequence is 2 bits
smallest unit of information is a bit
The total bits of a sequence n is the addition of the bits at each position.
Information theory
trying to pick out an identified object
we have eight objects each of the objects are assigned a binary number, each object is identified by a distinct binary number
when do i have sufficient information to identify the object
if you only have one bit of information when you search through the object you are going to identify half
Minimum amount of information
Imin = log2 N
I>= Imin to be sufficient information
N is the number of objects/length of the genome
12 base sequence = 24 bits > log2 8 X 10^6 = 22.9
17 base sequence = 34 bits > log2 6 X 10^9 = 32.5
BLAST Basic local alignment search tool
Genbank has about 10^13 bases of sequence in the database. Therefore, Imin= 44 and a sequence of 23 bases or more could be expected to be unique, more often than not, in the database.
Blast takes your query and breaks it into short words (5 sequence words), they take the database and break into words as well (5 base word) these words are indexed with position information.
If so the blast program creates an aligment to align the query to the sequences in the database
Two statistics of a BLAST
Score: higher the score the better.
e value: lower the e value the better.
e value is telling you what the random chance of your alignment occurring when searching the database. Low e value low chance.
Cost of DNA sequencing
Has fallen over the past 20years
large change occurs after massive parallel sequencing is introduced
Sanger
reads per/run: 1
Length of read: 1,000 bases
basic method: termination
expensive/accurate
Illumina
length of read: 100 bases
# of reads/per run: 10^9-10^10
basic method: real time incorporation
cheap/accurate
Pacific bioscience
length of read: 15,000 bases
reads/run: 10^6
real time incorporation
cheap/low accuracy
nanopore
read length: 100,000
#reads per run: 10^6
current changes
cheap/low accuracy
Real time incorporation: Illumina
A series of rounds of synthesis, the nucleotides are modified to have a fluorescent group and a blocked 3’OH
after cycle one the 3’OH and the fluorescent group is removed
the cycle is repeated, continue on and on and take successive pictures to determine the sequence of DNA
Real time incorporation: Pacbio
the cell is set up so you can see the DNA polymerase, and DNA polymerase incorporated labeled dNTPs, the label is at the very end of the phosphate, what you observe is the read output, labeled dNTPs are incorporated in the chain, this fluorescence is detected, and diffuses away shortly after incorporation
Nanopore
the sequence that is in the pore alters the amount of current that can flow through the nanopore in a specific manner depending on the sequence.
as each base goes through we see a current change
a set of bases is read and is being read, the current we’re looking at is the G coming out naf the T going in
Illumina basic steps
DNA molecule has adaptors added.
Sticks to oligos on a glass slide.
Amplified to make a microdot on the slide.
First sequence read.
Reorientation of the DNA molecule.
Second paired sequence read.
Bridge amplification to close fragments
Pacific Biosciences sequencing video
DNA molecule has adaptors added.
DNA isolated
Label adpators to DNA creating a circular template, primer and polymerase is added to the library
DNA and DNA polymerase in a SMRT cell.
Smart cell contains Zero-Mode Waveguides (ZMWs)
A single molecule of DNA is immobilized at the bottom of the ZMW (smallest detection unit)
As polymerase incorporates nucleotides, light is emitted (after which the fluorescent tag is cleaved)
Nucleotide light emission is used to measure incroporation
As each ZMW is illuminated from below the wavelength of the light is too large to allow it o pass through the waveguide
Attenuated light form the excitation beam penetrates the lower 20-30 nm of each ZMW
DNA sequence read.
Circular concensus sequencing mode to produce highly accurate long reads (Hifi)
Use continuous long read sequencing mode to generate long as possible reads
Nanopore sequencing video
DNA molecule has adaptors added.
Motor molecule binds
Adaptors are important for the binding of a motor molecule, the DNA is released over membraens tha thave a single pore over them.
The DNA molecules guided to the nanopore with the help of tethers that will grab onto the motor and help position it over nanopore.
Sequencing directly enables the detection of base modifications and methylation
Motor protein recognizes adaptor sequence bound to DNA and guides DNA to the nanopore
Tether guides DNA through the nanopore
DNA passes through causing a disruption of ionic current measured in signal trace.
Intact DNA sequences are sequenced in real time by the nanopore, regardless of their length (no runtime, fragments)
DNA enzyme complexes approach the nanopore and the single stranded DNA is pulled through aperture of the nanopore, the enzyme ratchets the the DNA through the nanopore one nucleotide at a time.
The enzyme bands to a single stranded leader at the end of a single strand, unzipping the double stranded DNA.
Speed of the enzyme can be controlled.
Hair pin structure ensures both strands of DNA can be read in one read.
A nanopore
Contig
Assembling a summary sequence from fragments of DNA
Genome sequecing starts with
All genome sequencing starts with random fragmentation.
In the tube you have a random set of fragments from different genomes of different sizes, starting at different positions.
If you pick enough DNA to sequence such that you would have sequenced six random sequences of 11 bases in length, every nucleotide once.
Picking them randomly is not a complete genome sequence
Coverage
how many times has a base been sequenced = bases sequenced/length of genome.
Why do we have gaps?
This is called the zero class
the proportion that will be zero = e^-m
where m is the expected mean = # of times you seqeunced the genome
e^-m X total length of the genome (ss)
Repetitive sequence also creates gaps in assembly.
High information content is necessary to create a sequence
Scaffold
A scaffold is an ordered array of contigs with gaps remaining.
How do we generate scaffolds?
Paired end reads are used to create a scaffold.
Sanger and Illumina
DNA fragment of some length, what we can determine a read of DNA sequence at one and and a read at the other end. The reverse complement.
Distance between them.
These two sequences are linked by a defined distance
Is the human genome completely sequenced?
8% is missing
sequences around the centromere
sequences of tandem rDNA (rRNA genes) repeats
Sequences of areas with repeats
Other genomes may remain incompletely sequenced due to the lack of interest/funds.
Reference genomes
Drosophila: genome determined from a specific fly stock.
Human: Was supposed to be determined from a number of individuals but ended up with most from a single white male.
References genome allows rapid assembly of genome sequence for a single individual.
So its easy once you have a reference genome just to align your sequence reads with the reference genome
NO. Comparison with other Human genomes a large amount of sequence is missing. A more sophisticated Human reference needs to be developed.