Bioinformatics - investigating gene regulation Flashcards
What is the old method for sequencing DNA?
Sanger sequencing
Also known as the chain terminator or dideoxy method)
Invented by double Noble prize winner, Fred Sanger
Was used to produce the first human genome sequence (2001)
How does sanger sequencing work?
Uses an E. coli enzyme to make complementary copies of the single-stranded DNA being sequenced
DNA polymerase I, uses a single DNA strand as a template and takes dNTPs and assembles a complementary polynucleotide chain in the 5’ to 3’ direction
A small amount of ddNTP (lacks 3-OH’), when this analog is incorporated chain growth is terminated
How can we visualise the sequenced DNA in the Sanger method?
The chain terminators (ddNTPs) are labelled by coloured fluorescent dyes
So the generated set a DNA fragments differing by one nucleotide, is separated in gel electrophoresis and passed through a detector to visualise the fluorescence
What is done at the end of Sanger sequencing?
The reads that have been sequenced must be correctly assembled into the original strand of DNA
The reads are compared with overlapping sequences
OR
sonication - fragements generated in a solution of stiffdouble-stranded DNA to high frequency sound waves
What is the capacity and speed of modern Sanger sequencing?
Sequences up to 600-800 bases in one read
Typical machine does 96 samples in 2.5 hours with 15 mins of technician time = highly automated
Yearly output of a machine – 30MB (30x106 bases)
(cf – human genome 3GB (3x109 bases))
But you need to create much more sequence than that to reduce error and assemble the sequence
Typically each base needs to be sequenced 10s-100s of times
Describe the development from the first human genome?
The first human genome ‘reference’ sequence took years (~1990 – 2001/2004) to build
Cost approximately £1 billion
‘Next generation’ sequencing methods will sequence a genome in less than a week and at a cost of a few hundred pounds
These have been progressively improved since the early years of this century
Example - Illumina/Solexa method
Describe Illumina sequencing?
Works by running many miniaturised sequencing reactions for different DNA fragments in parallel
Generates short sequence ‘reads’
As the method has developed these have been 36, 51, 75 and now >300 base pairs in length
What is the output of Illumina?
‘Bench top’ sequencing – MiSeq, NextSeq
Run time 1-2 days
MiSeq – 25 million paired end reads (300 bp each)
NextSeq – 400 million
Production sequencing – HiSeq, NovaSeq
Runtime 1-3 days
5 billion – 20 billion paired end reads
Compare – human genome 3.1 G bases
1 billion 150 (paired) bp reads is 300 G bases – 100x the genome!
Most runs are now more than one experiment
What is alignment?
Short sequence reads are aligned to the reference human genome
Similarity analysis of sequences begins with alignment
We can see the evolutionary relationship between sequences by identifying the aligned bases
Computationally intensive - there are millions of reads to align
You can identify substitution, insertion and deletion mutations of bases
Insertion and deletions are called indels or gap
What are the uses of high-throughput DNA sequencing?
Genetics - sequencing genomic DNA and finding variants and their associations with phenotype and disease
Sequencing has been revolutionary in epigenetics also
DNase-seq - how can we find hypersensitive sites?
DNA is cut (digested) with DNase
DNA is more accessible to cutting in nucleosome free regions - often where proteins bind - promoters and enhancers
Sequence in from the cut ends
Align the resulting sequences to the genome
‘Peaks’ – regions where many reads align correspond to DNase hypersensitive sites
What is the initial data anlysis for DNase-seq?
Need computational methods for identifying peaks
Statistics to distinguish real peaks from noise
Ultimately reveals hypersensitive sites across the whole genome
These are cell type specific
Usually carried out on uniform populations of cells of the same type
Needs millions of cells
What is the further data analysis by DNase-seq?
Detect transcription factor binding by locating binding sequences in hypersensitive sites - as many transcription factors bind to specific sequence patters in DNA
Comprehensive collections of binding site patterns are in databases like TRANSFAC and JASPER
What is ATAC-seq?
Assay for Transposase-Accessible Chromatin using sequencing
Similar analysis to DNase-seq
It assesses genome-wide chromatin accessibility
Needs fewer cells
What is ChIP?
Chromatin immunoprecipitation (ChIP)
We can detect where transcription factors are binding, where modified histones are or where any DNA binding protein is
First crosslink DNA and proteins - using formaldehyde (short link cross-linker) and isolate chromatin
Sonicate or digest (with micrococcal nuclease) chromatin into fragments of around 500 bp
Immunoprecipitate - use antibody against protein of interest and use protein A or G to pull it down and then wash
Digest with proteinase K and reverse crosslinks before purifying DNA
PCR amplify target sequences (or detect for hybridisation)
Use primers against a specific gene in the PCR reaction to determine if that region is bound by the protein of interest