Microbial Genomics Flashcards
How does Sanger sequencing work? (hint - What does it exploit?)
What is this method an example of?
Exploited dideoxy-nucleotides, structural analogues of deoxy-nucleotides which are the standard building blocks of DNA, but which lack an OH group on the sugar
This means they can incorporate into a DNA strand during DNA synthesis, but do not permit further extension of the chain (no OH to stem off of)
Sequencing by synthesis
What was the process of the original sanger sequencing?
What was the original drawback and how was this improved?
As each nucleotide is added to the chain, there is a chance that a terminator nucleotide will be added instead
If that happens, no more bases can be added to that copy, and we end up with a truncated sequence
In Sanger’s original method, 4 reactions are carried out using the 4 dd/dNTPs separately
- Products are run side-by-side on a polyacrylamide gel to separate them according to size, and the sequence can be read off the gel from the bottom upwards
Use of fluorescently-labelled dideoxy bases meant that sequencing could be performed in a single capillary tube, with the bases distinguished by the colour of the fluorescent label
What was the first microbial genome sequenced?
What were 2 key features?
Bacteriophage ΦX174 ; Phage for E. coli
Genome is extremely compact
Overlapping open reading frames; Same section of DNA encodes two different proteins
What is a commonly used model organism?
How was it sequenced?
E.coli K-12
Clone-by-clone sequencing
- 250kb sections of the E.coli genome were cloned into bacteriophage λ, and ordered based on information from a genetic map
What was the first species identified via shotgun sequencing?
Haemophilus influenzae Rd in 1995
Explain shotgun sequencing
Why might this be preferred over other sequencing methods?
Fragmentation of random sections (e.g. restriction enzymes) of the genome which are then sorted by size and cloned into a vector
Vector is inserted into E. coli to clone fragment
Followed by computational assembly of the complete chromosome
Small genomes
G+C base composition similar to that of humans (38%)
No physical map available for ordering of genetic information
Give more detail about the assembly step of shotgun sequencing (hint - paired and long)
Drawback that prevents complete chromosome sequencing? (hint - not cloned)
How is this addressed?
Reads are obtained from either end of each DNA fragment
Reads can be computationally assembled to produce long sequences
Likely to be gaps due to regions of the genome which could not be cloned, or due to repetitive sequences which could not be resolved during the assembly process
A “finishing” step is needed to manually close gaps in the alignment through the use of molecular biology methods such as PCR, Southern blots and sequencing
How is the chance of errors reduced in shotgun sequencing?
Enough sequence data will be obtained to cover the genome several times over with each base pair in the genome being sequenced multiple times
- This helps to reduce errors in the assembly, as any errors in one read will not be present in the other reads covering the same region
How did we discover an essential minimal gene set for life? (hint -H. influenzae and M. genitalium)
Comparison of H. influenzae and M. genitalium showed 240 genes conserved
Further study showed it was closer to 256 genes
What is non-orthologous replacement?
When intermediate steps in essential pathways are performed by non-homologous proteins between 2 highly conserved species
What are FUN genes and how did E. coli sequencing play a role in their discovery?
Function UnknowN (FUN) genes
WGSs of E. coli K-12 identified a large number of ORFs (~20% of genome) which had no known function and no similarity to previously characterised sequences
When has sequencing help identify how a pathogen evades immune interactions? (hint - Campylobacter jejuni)
With Campylobacter jejuni
Sequencing revealed a large number of hypervariable sequences
Regions can vary in length during DNA replication due to strand slippage, resulting in changes in the sequence or expression of some genes, often ones associated with the synthesis of surface structures
What did sequencing of M. tuberculosis provide insights into? (hint - antigenic)
What has availability of the reference genome sequence enabled? (hint - diverisity)
What drug have we characterised and how has this sequencing information helped? (hint - vaccine)
Genes potentially associated with antigenic variation and pathogenicity, and the large portion of the genome devoted to lipid metabolism
Enabled characterisation of the global genetic diversity of M. tuberculosis strains and provided insights into the development of antimicrobial resistance
It has also allowed characterisation of the BCG vaccine, an attenuated derivative of a closely related organism which has a genome 99.95% identical to M. tuberculosis, but with a reduced genome size due to a series of deletions
What was discovered to be the cause of some gaps when shotgun sequencing E. coli? (hint - toxic)
What has this helped develop? (hint - novel)
Genes/fragments could not be cloned because they encoded gene products which are toxic to E. coli
Such toxic genes are of considerable potential in the development of biotechnological applications and novel antimicrobial therapies
What is a likely source for 20% of genes in E. coli? (hint - not randomly distributed)
Horizontal gene transfer
How do we identify likely regions of horizontal gene transfer?
Believed that bacterial genomes evolve towards a particular GC content
Any genes acquired through phage or recombination may not have GC content that is typical of genome as a whole
- Atypical GC content is a marker of horizontal gene transfer
What is the strain E. coli O157:H7 of ten associated with?
What type of E. coli is this and what does it produce? (hint - EHEC)
Pathogenic E. coli outbreaks
Enterohaemorrhagic E. coli (EHEC),
- Produces Shiga toxin and is associated with haemorrhagic colitis and haemolytic uraemic syndrome (HUS), which can lead to kidney failure and is sometimes fatal
What is often a driving factor leading to the sequencing of certain microbes?
Importance as a pathogen e.g. E. coli strain O157:H7
When sequencing O157:H7 EDL933 and K-12 we discovered regions called O-islands and K-islands. What are these?
Why was there discovery a surprise? (hint - expected similarity)
O-islands - Clustered regions of extra DNA in the O157:H7 strain
K-islands - Regions unique to E.coli K-12
It was expected that the O157 genome would be similar to K-12 plus genes associated with pathogenicity
Several of the O- and K-islands were located at the same position in the genome; Why?
Occurs as there are recombination hotpots in the genome – Easy to acquire horizontally transferred DNA
What type of E. coli is strain CFT073? (hint - uropath)
They are not harmful in intestines, but where do they become pathogenic and cause infections?
What does ExPEC mean?
Uropathogenic E.coli (UPEC)
Associated with urinary tract infections (UTIs)
Extraintestinal E.coli (ExPEC)
Explain the genome ‘patchwork’ structure
Shared co-linear backbone interrupted by strain-specific islands
WHat did the first large-scale comparison of bacterial genomes uncover (8 strains of Streptococcus agalactiae)? (hint - 3 types of genome)
Core genome – Total set of genes conserved across all strains of a species
Dispensable (accessory) genome – Non-core genes present in the genome of each individual strain; Not conserved in at least one other member of species
Pan-genome – Total non-redundant set of genes associated with any strain of a species
How do we estimate size of core genome?
Randomising the sequencing order of the genomes, and looking at how the size of the core genome reduces as additional genomes are added to the analysis
Done lots of times, and the median size of the core genome is calculated as each additional genome is added
By adding a trendline we can estimate when the size of the core genome would plateau (i.e. the point at which the trend line would be horizontal)
Similar methods for uncovering core genome were used to uncover E. coli pangenome. What did we find?
Trend line does not plateau, instead it approaches a straight line sloped upwards
This is because E. coli (and S. agalactiae) have “open pangenomes”
and are effectively infinite in size
Difference between close and open pangenome?
What is it possible to do with closed but not open pangenome organisms?
Give closed pangenome organism example
Closed pangenome (e.g. Yersinia pestis) which are finite in size as they don’t pick up additional DNA as easily
Open pangenomes are infinite in size as they pick up DNA easily
With such organisms it is possible to comprehensively characterise all of the genes in the pangenome by sequencing just a few strains
What is an alternative method of determining open or closed nature of an organisms pangenome?
What does this plateau to for E. coli?
Estimate how many new genes are discovered with each genome sequenced
For E. coli this plateaus to a non-zero value of around 300 genes, meaning that you can continue to sequence even large numbers of E. coli genomes, and you will keep on identifying new genes indefinitely
- No matter how many independent isolates you sequence, on average you will find ~300 new genes in each
- E. coli has ability to pick up genes from anywhere and everywhere; Will always find new genes
What is 1 similarity and difference between Illumina and Sanger sequencing?
What is a drawback of Illumina sequencing?
Illumina also is an example of sequencing by synthesis
Illumina can sequence millions of molecules simultaneously - Massively parallel sequencing
Reads aren’t very long; ~100bp for each sequence vs ~800bp in Sanger sequencing
- Makes it harder to piece together genome
Bridge amplification is used to amplify fragment of genome on flow cell for sequencing. Explain the process
Adaptor ends of single template molecule hybridises to a primer sequence attached to the surface, and the opposite end can fold over to hybridise to adjacent primers
Addition of DNA polymerase allows the production of a second copy of the template
Both of these copies can repeat the process, and this continues through multiple cycles until there is a cluster of identical molecules
Across the surface there will be millions of clusters, each representing a different fragment of the original genome
How do we sequence clusters produced via bridge amplification?
Synthesising complementary strand using fluorescently-labelled nucleotides
Reversible terminator nucleotides – Cleave off fluorescent label after imaging to allow for chain extension
Illumina sequencing can rapidly generate short sequence reads. What are these assembled into?
How can we order these?
Chunks (contigs), but still require finishing
Contigs can be placed into order by comparison with a closely-related complete genome
Why are most genomes left at the draft stage?
Finishing is more expensive than generating a draft
What has helped more draft genomes to be finished?
The advent of third generation sequencing (Oxford Nanopore/PacBio)
Campylobacter is a food-poisoning bacterium in undercooked chicken. Host-switching was found to be common, but specific lineages of the phylogenetic tree seemed to be associated with particular hosts.
What did studies allow us to identify and how? (i.e. what is this an example of? - GWA)
Bacterial genome-wide association study to identify genes associated with particular phenotypes
Looking for genes which were over-represented in strains from a particular host, they were able to identify vitamin B5 biosynthesis as a host-specificity factor
Strains from chicken will only grow in the presence of vitamin B5, since they lack the genes necessary to synthesise it; It was suggested that this was an adaptation to the diet of the host