Exam 1 Study Flashcards
1871
Friedrich Miescher identified the presence of ‘nuclein’
1953
James Watson and Francis Crick, Rosaland Franklin and Maurice Wilkins, discover the double helix structure of DNA
1977
Frederick Sanger develops a DNA sequencing technique
1983
Kary Mullis develops polymerase chain reaction a technique used for amplification of DNA
1987
the term genomics first used in scientific literature
1990
Human Genome Project is launched
2003
Human Genome Project is finishes
2007
Illumina “next-generation” sequencer is available
haploid cell number of bases
3 million
How much of the mammalian genome is coding
2%
Mitochondrial DNA inheritance
strictly maternal
Chromatin
DNA with a protein scaffolding
DNA is wrapped around histones
Constitutive heterochromatin
inactive
Centromeres are used by the cell
during cell division to make sure that each daughter cell gets a copy of each chromosome
Centromeres are
highly repetitive
Telomeres are located_____ and do what
at the ends of chromosomes
protect the ends of the chromosomes
Repetitive DNA is
Tandem
Interspersed
Segmental duplications
Lines are what percent of the genome
17%
around 5-6 Kb
Sines are what percent of the genome
11%
<500 bp
Cytoplasmic genome
circular
uniparental inheritance
small compared to nuclear
thousands of copies per cell
heteroplasmy
Segmental duplications
low copy repeats
blocks that range from 1 to 400 kb in length
occur at more than one spot in the genome
and typically share a high level of sequence identity
about 5% of the human genome
Centromeres are how may bases
100s Kb to Mb
Telomeres are how many bases
10s Kb
Three parts of DNA or RNA
Pentose sugar
nitrogenous base
Phosphate group attached to the 5’ carbon
DNA uses what sugar
RNA uses what sugar
Deoxyribose
ribose
Purines
Adenine
Guanine
Pyrimidines
Cytosine
Thymine
Uracil
The phosphate group allows
two nucleotides to be linked
creates the stream of information that DNA encodes
5’ to 3’ linkage between a phosphate group of one nucleotide and the 3’ carbon of the next nucleotide’s sugar
phosphodiester bonds
The two ends of the polynucleotide chain are
not the same
5’ end-phosphate group attached to the 5’ carbon of the pentose sugar
3’ end has a hydroxyl group
The polynucleotide chain has
polarity
5’ to 3’ ends
A-T bond has how many H bonds
2
C-G bond has how many H bonds
3
A-U bond is in
RNA
Watson and Crick investigated the structure of DNA not by collecting new data but by
using all the available information about chemistry of DNA to construct molecular models
DNA Structure-3 main points
double helix
strands are antiparallel
base complimentary
What type of bond is between base pairs
H bonds
Weak enough to be broken and then used
DNA strands are arranged helically with ___ base pairs between each turn of the helix
10
Raw materials of DNA synthesis
Template
-single stranded DNA
Enzymes
-DNA polymerase
Raw materials (substrate)
-dNTPs
Mg2+ ions
DNA polymerase does what
catalyzes the formation of phosphodiester bonds
joins the 3’-OH group of the last base in the DNA chain to the incoming 5’-phosphate of a dNTP
Synthesis is what direction
5’ to 3’
dNTP is
selected by the DNA polymerase using the opposing base on the template strand
Key features of DNA replication in Eukaryotes
occurs in the nucleus during S phase of the cell cycle
is initiated by RNA primers
Occurs in the 5’ to 3’ direction
semiconservative
Initiated at the same time at many points along the chromosome
heterochromatin replicates later than does euchromatin
All DNA polymerases require a
free 3’ OH
Gyrase ds breaks to
relieve torsional strain
Helicase breaks
H bonds between bases
SSB proteins protect
free DNA, prevent secondary structure
Packaging of newly replicated DNA
histones must first disassemble to allow DNA synthesis (uses old histones)
Synthesis of new histones is coordinated with DNA Synthesis
then resembled into new chromosomes
1952
Hershey-Chase experiments are carried out by Alfred Hershey and Martha Chase to demonstrate that DNA, rather than protein, carries our genetic information
How many years from the identification of nuclein to the demonstration of DNA as the genetic material
81 years
How many years from the first sequencing method to HGP start
13 years
2001
first draft of the human genome sequence released 3 Gb
How many years between the first bacterium to the Human Genome Project
8 years
Celera vs Human Genome Project
HGP clone by clone approach
Celera whole genome shotgun
2007
solexa 1G sequencer is available
next generation sequencing
Order of bases
base pair
kilobase
megabase
gigabase
terabase
petabase
Moore’s law
the number of transistors incorporated in a chip will approximately double every 24 months
Number of copies of target
N times 2c
Requirements for DNA replication
DNA template
DNA polymerase
Nucleotides
Primers
PCR three steps
Denaturation
Annealing
Extension
Keys to PCR success
primer specificity
annealing temp
Mg++ concentration
Limitations of PCR
size
base complexity
secondary structure
Sanger sequencing uses
4 tubes- one for each base
ABI sequencing uses
one tube with 4 fluorescent labels
Key components needed for transcription
DNA template
the raw materials (ribonucleotide triphosphate)
transcription apparatus
What has to happen to the DNA in order for a gene to be transcribed
uncoiling
DNA molecules undergoing transcription exhibit ___
christmas tree-like structures
Regulatory regions determine
what, when, where, how much
Regulatory promotors are
upstream of core promotor
affect the rate of transcription
mRNAs have a
5’ cap and 3’ poly A tail
Most eukaryotic organisms have
introns
non-coding region of DNA
In eukaryotes, intron size and number is related to
organism complexity
Introns have
regulatory roles and are longer than exons
in order to have collinearity, introns are
spliced out by snRNPs in a splicesome
All sequences in DNA that are transcribed into a single RNA molecule
a gene
How many bases are needed to distinguish 20 amino acids
4
The genetic code is ____ which means it repeats alot
degenerate
sequencing is a
tool to be applied to address a question
4 basic steps of Illumina Sequencing
1 sample prep
2 cluster generation
3 sequencing
4 data analysis
Library fragment size has
downstream implications for analysis
Patterned flow cells give
faster scan times due to ordered cluster positions
less cluster overlap
more clusters
Sequence by synthesis
one nucleotide is added at a time
Problems with sequence by synthesis
very accurate but dye can be not cleaved off- see both colors then
quality degrades the longer it is
In Illumina sequencing the dye is
covalently bonded to the base
Illumina sequencing is based on
reversible terminator chemistry
Sequencing by synthesis
Types of color coding for sequencing
4 channel-each nucleotide has its own color
2 channel- uses two colors (Ais green and pink, G has none, T is green, C is pink)
1 channel-will be discussed later
Error for Illumina Sequencing
Clusters start to condense
less resolution
occurs due to physical properties of SBS
equality differences
SNPs
How much can Illumina NovaSeq X sequence
1600 Gb
How does Ultima Genomics work
like a dvd or cd
wells that are spun around and read by a laser
The ability to resolve a repetitive sequence is dependent on
the length of the molecules in your library
Long Read Technology
Oxford Nanopore (ONT)-protein nanopores
pacific BioSciences (PacBio)-SMRT
bionano genomics-optical maps
proximity ligation-assembly
Nanopore uses
and can do how much for what price
biological nanopores
10-20 Gb in <24 hours
around 600 dollars
nanopores have a diameter that are
in the same scale as many single molecules, including DNA
How does nanopore sequencing work
nanopore is embedded in the membrane
-current cannot travel through
-nanopore creates a hole and the current drives things through the pore
-then measure the change in electrical current to determine which nucleotide it is
-each nucleotide has a different structure so it creates a different electrical current
Nanopore can sequence how much
400 bp/sec
nanopore errors
homopolymers
Nanopore accuracy
> 99%
Which type of sequencing can detect base modification
nanopore
PacBio uses what type of sequencing
SMRT-single molecule real-time
PacBio uses what to do its sequencing
Nano-wells called Zero-mode Wave guidlines
polymerase bound to bottom of ZMW
Phospholinked nucleotides
light from nucleotide cleavage detected as polymerase processes DNA
PacBio mean read length
> 20 kb with a moderate error rate
Process of PacBio
start with high quality double stranded DNA
prepare SMRTbell libraries
anneal primers and bind DNA polymerase
circularized DNA is sequenced in repeated passes
the polymerase reads are trimmed of adapters to yield subreads
consensus and methylation status are called from subreads
PacBio sequencing rate
10bp/second
PacBio errors
homopolymers and indels
PacBio accuracy
> 99%
Error types of
Illumina
Oxford Nanopore
PacBio
-SNPs
-Homopolymers
-Homopolymers and Indels
HMW DNA is fluorescently labeled at
known sequence motifs
Bionano genomics process
HMW DNA is fluorescently labeled at known sequence motifs
DNA is stretched through nanochannels then imaged
creates a map of those sequence motifs
NOT SEQUENCING
HiC
order and orients contigs (set of DNA segments or sequences that overlap in a way that provides a contiguous representation of a genomic region)
PacBio characteristics
increasing read lengths+increasing throughput means decreasing cost
20-30 kb
per base error rate is 10-15%
Most popular long read
Oxford Nanopore characteristics
extremely long reads but relatively few-expensive
no upper limit on size- huge potential
most promising long read in a few years
bionano optical maps characteristics
inexpensive
significant improvement of genome assembly
both short and long reads
FASTA has how many parts and what are they
Two
1) > sequencing name
2) sequence
FASTQ has how many parts and what are they
1) @sequence name
2) sequence
3) + some other info
4) quality value (phred scale using ascii)
FASTA is used when
quality is not needed
presents only the sequence itself
chromosomes
gene structures
FASTQ is used
when quality is needed
sequence reads
Differences between FASTA and FASTQ
quality included in FASTQ using ASCII coded quality value
At a given position in a sequence, the base present is either A/C/G/T but we
cannot directly observe that base.
The base that is produced from a DNA sequencer is an observation based on some biochemical/physical property that has
error
QPhred=
-10 log10 P(error)
FastQC is
one of many software tools to evaluate quality
it does not actually do any filtering, provides summary metrics and visuals
important metrics
base quality
adaptor content
K-mers
any integer goes for k
it is a polymer
what are k-mers used for
to make distributions and estimate errors
sequence reads are typically how long
150 bp or less