Genomics Flashcards
Dye determination sequencing
- each ddNTP is labelled with a different fluorophore
- one reaction with all four ddNTPS
Single sequence to genome
- genome too long to sequence in one go
- can fragment the genome into large pieces (50-200kpb) which can be amplified in bacterial culture as bacterial artificial chromosomes (BACs).
- similar strategies in yeast and cosmids/fosmids which use other replication origins in bacteria
- fragments of DNA are cloned into the vector of choice and then tens of thousands of individual colonies are picked to create a library
- each clone contains a (hopefully) unique fragment of the genome sequence
BAC library
- rare cutting enzyme digest and clone (every 50-200kb)
- clone into a BAC vector to give bacterial artificial chromosomes
- pick each clone into a separate well - each well now contains a different genome fragment
- each fragment can be purified and analysed
- align by digest ‘fingerprint’
- shotgun sequence each BAC individually
Individual BAC clones
- have a restriction digest pattern matching that of the original genome
- by digesting with many enzymes the digestion patterns can be determined and matched to ‘tile’ BAC clones to give a physical map
- may also contain known DNA sequences or markers which can then be used to improve the physical map and link it to the genetic map
‘Shotgun’ sequencing with Sanger
- analyses all the bases at once for a single sequence
- breaks down the genome into manageable chunks at random then sequence
- fragment genomic DNA -> clone into sequencing vector -> pick colonies and sequence
- large libraries with tens of thousands of these clones can be constructed and mapped by restriction mapping
- requires the source DNA to be broken into approx. 1000bp chunks
- these are incorporated into a sequencing vector and sequenced using standard primers from both sides
Overlap Layout Consensus Method
- DNA is sequenced to produce a set of partial sequences (reads)
- a computer is used to assemble the sequence reads into a series of overlapping fragments
- the overlaps are removed by the computer to produce a single assembled sequence
‘Next generation’ sequencing
Sanger
- shotgun cloning is slow and expensive
- sequences one molecule at a time
- accurate
Illumina (Solexa)
- sequences all molecules at the same time
- quite accurate
- other competing technologies as well
- relatively short reads
- expensive for single sequences, cheap for many
High-throughput sequencing
Involves:
- the chemical amplification of DNA fragments
- the synthesis of complementary strands using fluorescently labelled nucleotides
- now outdated and rarely used
- single DNA molecule are attached to a solid surface
- each molecule is amplified in place by PCR (each spot is a PCR colony or ‘polony’)
- the four nucleotides (as nucleotide triphosphates), each labelled with a different fluorescent dye, are added, along with DNA polymerase and a universal primer
- only one nucleotide is attached to the primer by DNA polymerase. Unicorporated nucleotides are removed
- the newly added nucleotide is detected by a camera
- the cycle is repeated about 100 times
High-throughput sequencing
- The sequence of interest is first fragmented and fragments of a specific size isolated.
- Specific PCR primers are ligated onto the ends.
- These fragments are then hybridised to oligos on a flow cell (very dilute)
- The oligos attached to the flow cell act as primers to amplify the fragment attached to the slide. This forms a PCR colony of identical fragments
- then it is on to the sequencing process.
The HTS cycle
- add to growing chain 5’-3’
- detect label with camera
- chemically cleave the label revealing the 3’OH
- limit = about 130-150 cycles
- further extension blocked by the dye label
- immobilised template is hybridised with a 3’-labelled dNTP and the sequence extended by one base
- as 3’ is blocked (the fluorescent label acts as a protecting group), the chain cannot be extended
- excess reagents are removed and the presence of the fluorescent label is detected
- the 3’ position is deprotected, ready for the next cycle
Illumina Hiseq
- a small portion of the sequencing slide can read >150 million sequences at the same time for each sample
- scaling this up, an illumina hiseq has 8 lanes
- each lane can be used for a different sample
- each lane can give 20-40 million sequences up to 150 bases
- 1 run takes 3-6 days (~1 hour per base)
- 3010^68 lanes * 150 bases = 36 Gbo (approx. ten human genomes)
- now up to 150Gb per run (50x coverage)
Chromosomes
have a single DNA molecule with specialised DNA sequences for the initiation of DNA replication, for spindle interactions in mitosis (centromeres), and for maintaining the integrity of the ends (telomeres)
Protein gene expression
occurs at open reading frames, from which RNA polymerase transcribes mRNAs that are translated to form polypeptides, which become functioning proteins. Genes contain DNA sequences for control of their expression
Protein coding genes
generally not repetitive but there are some exceptions, e.g. gillagrin and high copy number genes
Repetitive regions
microsatellites, telomeres, intron sequences
tRNA
very similar sequences (but very short)
rRNA
many copies of some ribosomal genes
Transposons
mobile genetic elements - sequence of a few kb that can move about the genome. Thousands of copies in eukaryotes
Size matters
- the longest repeats in microbial genomes are about 7kb
- with the latest technologies we can read right through them
- without extra long reads we need to improvise with paired-end reads
Contig
a contiguous (continuous) consensus sequence from an assembly
Scaffold
a series of contigs where we have additional information to place them together in the right order and orientation but the sequence between the contigs is not complete
Assembly
the set of scaffolds for one genome
N50
the size of the largest contig/scaffold of which is 50% of the assembled data is in a contig/scaffold of that size or larger
Read length
- A single read cannot span a repetitive region that is longer than the read length.
- This prevents long contigs from forming.
- The longer the read length the larger the repeat region that can be assembled.
Read depth/coverage
- The average number of times each base appears in the final assembly.
- A coverage of 10X means that each base is on average found in 10 reads.
- The deeper the coverage, the more clearly any sequence or structure changes can be discerned from sequence error
Ploidy
- The number of copies of the genome in the organism.
- Bacteria =1; Human=2; Potato=4; Strawberry=8
- The higher the ploidy, the harder it is to accurately assemble.
Genomic resequencing
- to look for a variant
- identify differences between strains/organisms/individuals
- assembly against a reference is much easier than de-novo sequecing
- may impact how you are treated medically in the future/potential of personalised medicine
Resequencing steps
- different to reference sequence
- gap compared to reference sequence
- duplicated gene or region?
Challenges of short read re-sequencing
- deleting a whole genome, hard to look for something that’s not there
- duplication is same kind of problem
- inversion, if sequence is short its hard to tell
Single molecule real time sequencing (PacBio)
- long read (10kb+)
- high error rate (14%)
- cyclising the template means it can be read many times and an accurate consensus obtained
- iontorrent works in a similar way but detects the pH change on nucleotide addition
Nanopore sequencing
- as DNA is passed through the nanopore by the molecular motor under the influence of a potential
- the current changes in a detectable way depending on the bases occluding the pore
- the current can be interpreted to read the DNA sequence
- to improve accuracy, a hairpin adapter is ligated to the end of the DNA fragment
- this causes both strands of DNA to be read sequentially
Ultra long read issues
Accuracy
- at present around 95-98%
Throughput
- much slower than Illumina (5Gb/48hr vs 150Gb/96hr)
Toolset
- these are new technologies and the analysis tools are still being developed
Sequencing summary
- most sequencing is by synthesis
- current sequencing technologies can produce terabases per day
- assembly is a challenge, especially for large genomes
- repetitive regions are challenging
- small changes compared to a reference are challenging
- new technologies are helping to solve the challenge
- careful experimental design can help solve the challenge
- long reads - lower throughput but better for genome structure
- short reads - higher throughput but better for sequence accuracy
How do we differ from one another?
we differ from each other in small polymorphisms and structural variation
What is a reference genome?
A standard sequence against which we can compare other sequences
What are the problems with the reference?
- The reference is from a very small subset of donors and is a mosaic
- People vary, in some regions far more than others. (GRCh38 has 261 alternate scaffolds)
- The reference is incomplete (603 gaps)
dbSNP
a collection of genomic variation for human and other species
Single nucleotide variants/polymorphisms
- substitution
- deletion
- insertion
Structural variants
changes in the overall structure of the genome
- duplication
- loss
- translocation
- inversion
- repeat
- deletion
Haemophilus Influenza
- a human pathogen that lives in the upper intestinal tract
- can cause conjunctivitis and meningitis
- sequenced in 1995
- first whole genome sequences using a shotgun method
- since then 28 different strains have been sequences and the pathogens have been identified
What does the presence of genes responsible for different functions determine?
the virulence of the organism
Virulence
the degree ofpathogenicitywithin a group or species ofparasitesas indicated by case fatality rates and/or the ability of the organism to invade the tissues of thehost
Virulence Factors
genes which produce products essential for virulence
Where do gene differences between strains tend to occur?
- in clustered regions on the chromosome
- provided support for the concept of lateral gene transfer
Lateral gene transfer
In some cases the genes coding for a specific cluster of genes can arise from a different source. Instead of progressive stepwise evolution, a cluster of genes from ‘foreign’ DNA is incorporated as a plasmid or integrated into the genome
Mycoplasma genitalium
- smallest known free living organism- commensal in the genitourinary tract
Chlamydia trachomatis
- human pathogen
- causes trachoma (blindness), pharyngitis, bronchitis
- obligate intracellular parasite transmitted by sexual contact
How does chlamydia trachomatis effect HeLa cells by infection?
- prevents fusion of phagosome and lysosome
- takes in ATP from host cells as it cannot produce it itself
Metabolism of chlamydia trachomatis
- metabolic pathways are patchy
- part of TCA missing
- cannot synthesise ATP
- doesn’t appear to synthesise amino acids
- contains a well defined recombinase pathway
- reported to recombine and reshuffle the genome quite readily
- contains many fatty acid and phospholipid synthesis
How do parasitic or commensal bacteria benefit?
- resources of the host
- taking things is more efficient than making them
- their genomes have adapted and lost key metabolic processes
DNA replication is expensive
there is no competitive advantage for a bacterium to keep DNA that is of no use or redundant
What is the minimal genome?
- take a small genome and make it smaller by knocking out genes
Transposon
- ‘mobile’ DNA element that codes for enzymes that allow it to relocate in the genome
- can be many 10’s of kb long and include many genes
Method applied to mycoplasma genitalium
- reduced essential gene count from 482 to 389
- synthesised the entire genome and transplanted into an empty cell
- the new synthetic organism grew
Means of horizontal (lateral) gene transfer
- virus
- ingest DNA. foreign DNA can be taken up by a variety of mechanisms Phage infection. direct introduction of DNA into cells (competent cells).
- ingest organism. by ingesting an organism then using its DNA
- conjugation (mating). Exchanging DNA with a related organism. Inside the cell the DNA could stay as a plasmid or integrate into the host genome via viral integrases
- OR transposon jumps from ingested DNA to genome (mobile DNA element)
Identifying horizontal gene transfer
- using phylogeny (horizontally acquired gene cluster)
- using sequence properties e.g. GC content
- genes incorporated from different sources may have different baseline GC content, or different kmer usage
Horizontal gene transfer - organismal tree
genes are transferred laterally between species e.g. up and down between C and D
Horizontal gene transfer - Gene x tree
- apparent close relationship of lineages inferred from sequences of x reflects the lateral transfer of this gene rather than the phylogeny of the organisms
Horizontal gene transfer - consensus tree
Based on multiple genes more accurately reflects the organismal phylogeny
Lateral gene transfer complicates phylogenetic relationships
- the phylogeny of four hypothetical prokaryote species, two of which have been involved in a lateral transfer of gene x
- a tree based only on gene x shows the phylogeny of the laterally transferred gene, rather than the organismal phylogeny
- a consensus tree based on multiple genes is more likely to reflect the true organismal phylogeny, especially if those genes come from a stable core of genes involved in fundamental processes
Factors that help an organism invade the host
Cell attachment - adhesins, fimbrae etc. Capsules - prevent attack by macrophages and digestion Degrading enzymes - hyaluronidase, proteases, lipases
Factors that help an organism evade the hosts defences
Toxins
- endotoxins and exotoxins
Immunosuppressants
- e.g. anti-immunoglobulin proteases
Endotoxins
part of the bacterial structure e.g. lipopolysaccharide
Exotoxins
secreted by bacteria e.g. shiga toxin, pertussis toxin, cholera, botox
Virulence factors
toxins etc and where they are coded
B.cereus strains
- they have incorporated the virulence factors into their genome
- do not have a plasmid
B.anthracis
- two plasmids
- pXO1 contains the toxins
- pXO2 produces the capsule, preventing phagocytosis and is used for immunization of domesticated animals worldwide
Sterne strain (34F2)
has lost the pXO2 plasmid
Anthrax in chimpanzees
In 2010 a group of researchers identified a bacterium responsible for a fatal anthrax-like disease in chimpanzees which had closer sequence similarity to B.cereus and B.turingiensis than to B.anthracis but contained the pXO1 and pXO2 plasmids (and a third plasmid.)
RNAseq
sequencing all the RNA molecules in a cell
Metagenomics
Sequence every organism in the environment
Microbial genomes are minimal
if a gene isn’t required then it tends to be lost
Microbial genomes reflect the biology
the genes tell us about the life of the organism
Microbial genomes are plastic
they are reshaped with additional plasmids, transposons etc. to add new functions
Genome sequencing is opening up…
… new areas for study
Large scale sequencing can…
identify disease loci through genome wide association studies