Genomics Flashcards
Dye determination sequencing
- each ddNTP is labelled with a different fluorophore
- one reaction with all four ddNTPS
Single sequence to genome
- genome too long to sequence in one go
- can fragment the genome into large pieces (50-200kpb) which can be amplified in bacterial culture as bacterial artificial chromosomes (BACs).
- similar strategies in yeast and cosmids/fosmids which use other replication origins in bacteria
- fragments of DNA are cloned into the vector of choice and then tens of thousands of individual colonies are picked to create a library
- each clone contains a (hopefully) unique fragment of the genome sequence
BAC library
- rare cutting enzyme digest and clone (every 50-200kb)
- clone into a BAC vector to give bacterial artificial chromosomes
- pick each clone into a separate well - each well now contains a different genome fragment
- each fragment can be purified and analysed
- align by digest ‘fingerprint’
- shotgun sequence each BAC individually
Individual BAC clones
- have a restriction digest pattern matching that of the original genome
- by digesting with many enzymes the digestion patterns can be determined and matched to ‘tile’ BAC clones to give a physical map
- may also contain known DNA sequences or markers which can then be used to improve the physical map and link it to the genetic map
‘Shotgun’ sequencing with Sanger
- analyses all the bases at once for a single sequence
- breaks down the genome into manageable chunks at random then sequence
- fragment genomic DNA -> clone into sequencing vector -> pick colonies and sequence
- large libraries with tens of thousands of these clones can be constructed and mapped by restriction mapping
- requires the source DNA to be broken into approx. 1000bp chunks
- these are incorporated into a sequencing vector and sequenced using standard primers from both sides
Overlap Layout Consensus Method
- DNA is sequenced to produce a set of partial sequences (reads)
- a computer is used to assemble the sequence reads into a series of overlapping fragments
- the overlaps are removed by the computer to produce a single assembled sequence
‘Next generation’ sequencing
Sanger
- shotgun cloning is slow and expensive
- sequences one molecule at a time
- accurate
Illumina (Solexa)
- sequences all molecules at the same time
- quite accurate
- other competing technologies as well
- relatively short reads
- expensive for single sequences, cheap for many
High-throughput sequencing
Involves:
- the chemical amplification of DNA fragments
- the synthesis of complementary strands using fluorescently labelled nucleotides
- now outdated and rarely used
- single DNA molecule are attached to a solid surface
- each molecule is amplified in place by PCR (each spot is a PCR colony or ‘polony’)
- the four nucleotides (as nucleotide triphosphates), each labelled with a different fluorescent dye, are added, along with DNA polymerase and a universal primer
- only one nucleotide is attached to the primer by DNA polymerase. Unicorporated nucleotides are removed
- the newly added nucleotide is detected by a camera
- the cycle is repeated about 100 times
High-throughput sequencing
- The sequence of interest is first fragmented and fragments of a specific size isolated.
- Specific PCR primers are ligated onto the ends.
- These fragments are then hybridised to oligos on a flow cell (very dilute)
- The oligos attached to the flow cell act as primers to amplify the fragment attached to the slide. This forms a PCR colony of identical fragments
- then it is on to the sequencing process.
The HTS cycle
- add to growing chain 5’-3’
- detect label with camera
- chemically cleave the label revealing the 3’OH
- limit = about 130-150 cycles
- further extension blocked by the dye label
- immobilised template is hybridised with a 3’-labelled dNTP and the sequence extended by one base
- as 3’ is blocked (the fluorescent label acts as a protecting group), the chain cannot be extended
- excess reagents are removed and the presence of the fluorescent label is detected
- the 3’ position is deprotected, ready for the next cycle
Illumina Hiseq
- a small portion of the sequencing slide can read >150 million sequences at the same time for each sample
- scaling this up, an illumina hiseq has 8 lanes
- each lane can be used for a different sample
- each lane can give 20-40 million sequences up to 150 bases
- 1 run takes 3-6 days (~1 hour per base)
- 3010^68 lanes * 150 bases = 36 Gbo (approx. ten human genomes)
- now up to 150Gb per run (50x coverage)
Chromosomes
have a single DNA molecule with specialised DNA sequences for the initiation of DNA replication, for spindle interactions in mitosis (centromeres), and for maintaining the integrity of the ends (telomeres)
Protein gene expression
occurs at open reading frames, from which RNA polymerase transcribes mRNAs that are translated to form polypeptides, which become functioning proteins. Genes contain DNA sequences for control of their expression
Protein coding genes
generally not repetitive but there are some exceptions, e.g. gillagrin and high copy number genes
Repetitive regions
microsatellites, telomeres, intron sequences
tRNA
very similar sequences (but very short)
rRNA
many copies of some ribosomal genes
Transposons
mobile genetic elements - sequence of a few kb that can move about the genome. Thousands of copies in eukaryotes
Size matters
- the longest repeats in microbial genomes are about 7kb
- with the latest technologies we can read right through them
- without extra long reads we need to improvise with paired-end reads
Contig
a contiguous (continuous) consensus sequence from an assembly
Scaffold
a series of contigs where we have additional information to place them together in the right order and orientation but the sequence between the contigs is not complete
Assembly
the set of scaffolds for one genome
N50
the size of the largest contig/scaffold of which is 50% of the assembled data is in a contig/scaffold of that size or larger
Read length
- A single read cannot span a repetitive region that is longer than the read length.
- This prevents long contigs from forming.
- The longer the read length the larger the repeat region that can be assembled.
Read depth/coverage
- The average number of times each base appears in the final assembly.
- A coverage of 10X means that each base is on average found in 10 reads.
- The deeper the coverage, the more clearly any sequence or structure changes can be discerned from sequence error
Ploidy
- The number of copies of the genome in the organism.
- Bacteria =1; Human=2; Potato=4; Strawberry=8
- The higher the ploidy, the harder it is to accurately assemble.
Genomic resequencing
- to look for a variant
- identify differences between strains/organisms/individuals
- assembly against a reference is much easier than de-novo sequecing
- may impact how you are treated medically in the future/potential of personalised medicine
Resequencing steps
- different to reference sequence
- gap compared to reference sequence
- duplicated gene or region?
Challenges of short read re-sequencing
- deleting a whole genome, hard to look for something that’s not there
- duplication is same kind of problem
- inversion, if sequence is short its hard to tell
Single molecule real time sequencing (PacBio)
- long read (10kb+)
- high error rate (14%)
- cyclising the template means it can be read many times and an accurate consensus obtained
- iontorrent works in a similar way but detects the pH change on nucleotide addition