Microbial Genomics Flashcards
How does Sanger sequencing work? (hint - What does it exploit?)
What is this method an example of?
Exploited dideoxy-nucleotides, structural analogues of deoxy-nucleotides which are the standard building blocks of DNA, but which lack an OH group on the sugar
This means they can incorporate into a DNA strand during DNA synthesis, but do not permit further extension of the chain (no OH to stem off of)
Sequencing by synthesis
What was the process of the original sanger sequencing?
What was the original drawback and how was this improved?
As each nucleotide is added to the chain, there is a chance that a terminator nucleotide will be added instead
If that happens, no more bases can be added to that copy, and we end up with a truncated sequence
In Sanger’s original method, 4 reactions are carried out using the 4 dd/dNTPs separately
- Products are run side-by-side on a polyacrylamide gel to separate them according to size, and the sequence can be read off the gel from the bottom upwards
Use of fluorescently-labelled dideoxy bases meant that sequencing could be performed in a single capillary tube, with the bases distinguished by the colour of the fluorescent label
What was the first microbial genome sequenced?
What were 2 key features?
Bacteriophage ΦX174 ; Phage for E. coli
Genome is extremely compact
Overlapping open reading frames; Same section of DNA encodes two different proteins
What is a commonly used model organism?
How was it sequenced?
E.coli K-12
Clone-by-clone sequencing
- 250kb sections of the E.coli genome were cloned into bacteriophage λ, and ordered based on information from a genetic map
What was the first species identified via shotgun sequencing?
Haemophilus influenzae Rd in 1995
Explain shotgun sequencing
Why might this be preferred over other sequencing methods?
Fragmentation of random sections (e.g. restriction enzymes) of the genome which are then sorted by size and cloned into a vector
Vector is inserted into E. coli to clone fragment
Followed by computational assembly of the complete chromosome
Small genomes
G+C base composition similar to that of humans (38%)
No physical map available for ordering of genetic information
Give more detail about the assembly step of shotgun sequencing (hint - paired and long)
Drawback that prevents complete chromosome sequencing? (hint - not cloned)
How is this addressed?
Reads are obtained from either end of each DNA fragment
Reads can be computationally assembled to produce long sequences
Likely to be gaps due to regions of the genome which could not be cloned, or due to repetitive sequences which could not be resolved during the assembly process
A “finishing” step is needed to manually close gaps in the alignment through the use of molecular biology methods such as PCR, Southern blots and sequencing
How is the chance of errors reduced in shotgun sequencing?
Enough sequence data will be obtained to cover the genome several times over with each base pair in the genome being sequenced multiple times
- This helps to reduce errors in the assembly, as any errors in one read will not be present in the other reads covering the same region
How did we discover an essential minimal gene set for life? (hint -H. influenzae and M. genitalium)
Comparison of H. influenzae and M. genitalium showed 240 genes conserved
Further study showed it was closer to 256 genes
What is non-orthologous replacement?
When intermediate steps in essential pathways are performed by non-homologous proteins between 2 highly conserved species
What are FUN genes and how did E. coli sequencing play a role in their discovery?
Function UnknowN (FUN) genes
WGSs of E. coli K-12 identified a large number of ORFs (~20% of genome) which had no known function and no similarity to previously characterised sequences
When has sequencing help identify how a pathogen evades immune interactions? (hint - Campylobacter jejuni)
With Campylobacter jejuni
Sequencing revealed a large number of hypervariable sequences
Regions can vary in length during DNA replication due to strand slippage, resulting in changes in the sequence or expression of some genes, often ones associated with the synthesis of surface structures
What did sequencing of M. tuberculosis provide insights into? (hint - antigenic)
What has availability of the reference genome sequence enabled? (hint - diverisity)
What drug have we characterised and how has this sequencing information helped? (hint - vaccine)
Genes potentially associated with antigenic variation and pathogenicity, and the large portion of the genome devoted to lipid metabolism
Enabled characterisation of the global genetic diversity of M. tuberculosis strains and provided insights into the development of antimicrobial resistance
It has also allowed characterisation of the BCG vaccine, an attenuated derivative of a closely related organism which has a genome 99.95% identical to M. tuberculosis, but with a reduced genome size due to a series of deletions
What was discovered to be the cause of some gaps when shotgun sequencing E. coli? (hint - toxic)
What has this helped develop? (hint - novel)
Genes/fragments could not be cloned because they encoded gene products which are toxic to E. coli
Such toxic genes are of considerable potential in the development of biotechnological applications and novel antimicrobial therapies
What is a likely source for 20% of genes in E. coli? (hint - not randomly distributed)
Horizontal gene transfer
How do we identify likely regions of horizontal gene transfer?
Believed that bacterial genomes evolve towards a particular GC content
Any genes acquired through phage or recombination may not have GC content that is typical of genome as a whole
- Atypical GC content is a marker of horizontal gene transfer
What is the strain E. coli O157:H7 of ten associated with?
What type of E. coli is this and what does it produce? (hint - EHEC)
Pathogenic E. coli outbreaks
Enterohaemorrhagic E. coli (EHEC),
- Produces Shiga toxin and is associated with haemorrhagic colitis and haemolytic uraemic syndrome (HUS), which can lead to kidney failure and is sometimes fatal
What is often a driving factor leading to the sequencing of certain microbes?
Importance as a pathogen e.g. E. coli strain O157:H7
When sequencing O157:H7 EDL933 and K-12 we discovered regions called O-islands and K-islands. What are these?
Why was there discovery a surprise? (hint - expected similarity)
O-islands - Clustered regions of extra DNA in the O157:H7 strain
K-islands - Regions unique to E.coli K-12
It was expected that the O157 genome would be similar to K-12 plus genes associated with pathogenicity
Several of the O- and K-islands were located at the same position in the genome; Why?
Occurs as there are recombination hotpots in the genome – Easy to acquire horizontally transferred DNA
What type of E. coli is strain CFT073? (hint - uropath)
They are not harmful in intestines, but where do they become pathogenic and cause infections?
What does ExPEC mean?
Uropathogenic E.coli (UPEC)
Associated with urinary tract infections (UTIs)
Extraintestinal E.coli (ExPEC)
Explain the genome ‘patchwork’ structure
Shared co-linear backbone interrupted by strain-specific islands
WHat did the first large-scale comparison of bacterial genomes uncover (8 strains of Streptococcus agalactiae)? (hint - 3 types of genome)
Core genome – Total set of genes conserved across all strains of a species
Dispensable (accessory) genome – Non-core genes present in the genome of each individual strain; Not conserved in at least one other member of species
Pan-genome – Total non-redundant set of genes associated with any strain of a species
How do we estimate size of core genome?
Randomising the sequencing order of the genomes, and looking at how the size of the core genome reduces as additional genomes are added to the analysis
Done lots of times, and the median size of the core genome is calculated as each additional genome is added
By adding a trendline we can estimate when the size of the core genome would plateau (i.e. the point at which the trend line would be horizontal)
Similar methods for uncovering core genome were used to uncover E. coli pangenome. What did we find?
Trend line does not plateau, instead it approaches a straight line sloped upwards
This is because E. coli (and S. agalactiae) have “open pangenomes”
and are effectively infinite in size
Difference between close and open pangenome?
What is it possible to do with closed but not open pangenome organisms?
Give closed pangenome organism example
Closed pangenome (e.g. Yersinia pestis) which are finite in size as they don’t pick up additional DNA as easily
Open pangenomes are infinite in size as they pick up DNA easily
With such organisms it is possible to comprehensively characterise all of the genes in the pangenome by sequencing just a few strains
What is an alternative method of determining open or closed nature of an organisms pangenome?
What does this plateau to for E. coli?
Estimate how many new genes are discovered with each genome sequenced
For E. coli this plateaus to a non-zero value of around 300 genes, meaning that you can continue to sequence even large numbers of E. coli genomes, and you will keep on identifying new genes indefinitely
- No matter how many independent isolates you sequence, on average you will find ~300 new genes in each
- E. coli has ability to pick up genes from anywhere and everywhere; Will always find new genes
What is 1 similarity and difference between Illumina and Sanger sequencing?
What is a drawback of Illumina sequencing?
Illumina also is an example of sequencing by synthesis
Illumina can sequence millions of molecules simultaneously - Massively parallel sequencing
Reads aren’t very long; ~100bp for each sequence vs ~800bp in Sanger sequencing
- Makes it harder to piece together genome
Bridge amplification is used to amplify fragment of genome on flow cell for sequencing. Explain the process
Adaptor ends of single template molecule hybridises to a primer sequence attached to the surface, and the opposite end can fold over to hybridise to adjacent primers
Addition of DNA polymerase allows the production of a second copy of the template
Both of these copies can repeat the process, and this continues through multiple cycles until there is a cluster of identical molecules
Across the surface there will be millions of clusters, each representing a different fragment of the original genome
How do we sequence clusters produced via bridge amplification?
Synthesising complementary strand using fluorescently-labelled nucleotides
Reversible terminator nucleotides – Cleave off fluorescent label after imaging to allow for chain extension
Illumina sequencing can rapidly generate short sequence reads. What are these assembled into?
How can we order these?
Chunks (contigs), but still require finishing
Contigs can be placed into order by comparison with a closely-related complete genome
Why are most genomes left at the draft stage?
Finishing is more expensive than generating a draft
What has helped more draft genomes to be finished?
The advent of third generation sequencing (Oxford Nanopore/PacBio)
Campylobacter is a food-poisoning bacterium in undercooked chicken. Host-switching was found to be common, but specific lineages of the phylogenetic tree seemed to be associated with particular hosts.
What did studies allow us to identify and how? (i.e. what is this an example of? - GWA)
Bacterial genome-wide association study to identify genes associated with particular phenotypes
Looking for genes which were over-represented in strains from a particular host, they were able to identify vitamin B5 biosynthesis as a host-specificity factor
Strains from chicken will only grow in the presence of vitamin B5, since they lack the genes necessary to synthesise it; It was suggested that this was an adaptation to the diet of the host
How were E. coli and Shigella distinguished in the pre-molecular era?
Basis of motility, metabolic profile and clinical manifestation
What is serotyping?
What is meant when 2 bacteria have the same serotype?
What 3 features were demonstrated to be useful for distinguishing different strains?
Serotyping involves raising antibodies against particular features on cell surface, and looking if they cross react between different trends
If they cross react (recognise both strains) then the strains must be closely related – Same serotype
O (lipopolysaccharide) antigen, H (flagellar) antigen and the K (capsular) antigen are useful for distinguishing between strains
What is a pathovar?
What pathovar was associated with 1940s outbreaks of E. coli?
Particular serotypes associated with outbreaks
Enteropathogenic E. coli (EPEC)
EPEC was originally defined by serology, but is now based on what?
Interactions with host cells
- Characteristic A/E lesions in the ileum
There are 4 pathovars other than EPEC, what are they?
Enterohaemorrhagic E. coli (EHEC)
Enterotoxigenic E. coli (ETEC)
2 other pathovars defined as distinct from EPEC based on their conformation when adhering to Hep-2 cells:
- Enteroaggregative E. coli (EAEC)
- Diffuse Adherent E. coli (DAEC)
What is the conformation when infecting Hep-2 cells for EPEC, EAEC and DAEC?
EPEC - Forms tight clusters
EAEC - Forms a “stacked-brick” pattern with cells adhering to eachother
DAEC - Defined based on diffuse adherence pattern; Not in association with eachother or host cells
What was one early way of measuring the evolutionary relatedness between different bacteria? (hint - DNA and temperature)
DNA-DNA hybridisation
DNA strands from 2 strains are hybridised together, and by measuring the temperature required to disassociate (melt) the hybrid DNA into separate strands, it is possible to estimate the degree of relatedness
- More similar means more base pairing, so higher melting point
How was measuring the electrophoretic mobility of enzymes used to quantifiably study E. coli population genetics?
If enzyme is related between organisms, then it will show similar motility; Variations in sequence may affect enzyme motility – Concept of MLEE
What is Multi-Locus Enzyme Electrophoresis (MLEE)?
Why is this better than serotyping? (2 reasons)
Involves assessing the electrophoretic mobility of a series of purified enzymes; Compare mobility of different bacterial enzymes to distinguish strains
Produces quantitative molecular data which can be used to understand evolutionary relationships between strains
Early studies showed that serotyping doesn’t correlate well with genetic diversity as measured using MLEE; Genetically similar strains can have different serotypes, and distantly related strains can share the same serotype
What is the ECOR collection and how was it established?
Why were theses trains selected?
What 3 factors were maximised with these strains?
Standard reference collection of 72 E. coli strains developed via MLEE
Represent the full diversity of the species
Electrophoretic diversity
Geographical distribution
Host range; Many of the selected strains originating from animals
What were the 5 phylogroups defined by phylogenetic analysis of the ECOR collection using MLEE?
A, B1, B2, D and E
What did DNA sequencing and phylogenetic comparison of strains allow us to uncover and what did this suggest? (hint - different evolution)
Some strains showed different evolutionary relationships when different genes were analysed
Suggested possibility of recombination (horizontal gene transfer) between different lineages of E. coli
Nucleotide data from genes thrB and thrC indicated there were multiple separate Shigella lineages within diversity of E. coli. What did they all show? (hint - con…)
Convergent evolution of their defining characteristics
What is Multi-Locus sequence typing (MLST)?
What are housekeeping genes?
Amplification and sequencing of ~400bp sections of 7-8 housekeeping genes distributed around different chromosomal regions
Genes involved in ‘day-to-day’ functions like metabolism and energy production; Less likely to undergo recombination
Using MLST, they characterised evolutionary origins of E. coli pathovars EHEC and EPEC. What did they uncover?
What does this suggest? (hint - genetic requirements)
Both showed evidence of multiple origins
2 separate clades of EHECs and EPECs
- Had parallel acquisition of virulence determinants e.g. virulence plasmids and toxin genes
Suggests that genetic requirements for each type of pathogenesis (e.g. genes for type II secretion system) can be acquired independently on multiple occasions
Complete genomes of 4 Shigella serovars were show to have evolved in parallel. What lead to this conclusion? (hint - deletion and mechanisms)
What did this suggest? (hint - evolution)
Genomes all showed large numbers of gene deletions
Some of the gene deletions characteristic of Shigella had occurred by different mechanisms in the different species
Suggested convergent evolution towards a Shigella phenotype; Obligate pathogens of humans
What is now the Gold standard for evolutionary analysis of bacterial genomes?
Core genome phylogenetics
What was identified by Walk et al. through an MLST-based study characterising environmental isolates of E. coli? (hint - clades)
5 “cryptic clades” of E. coli; C-I to C-V
What was identified about clade C-I?
What was identified about C-II, C-V and the sister clades C-III/C-IV?
- What did later studies do? (hint - species)
Clade C-I was closely related to, but outside divergence of, existing E. coli strains
C-II, C-V and the sister clades C-III/C-IV were more divergent; Showing similar evolutionary distance from E. coli as other species
- Later studies determined that clades C-II, C-III/IV and C-V were sufficiently diverse to be defined as new Escherichia species
How does 16S rRNA sequencing of bacteria work? (hint - high and low conservation)
Most of 16S is very highly conserved, meaning it is possible to reliably amplify by PCR; Primers binding conserved regions
Yellow “V-loops” can change sequence without disrupting function of RNA; Lower levels of conservation
- Primers in conserved regions used to amplify and sequence the variable regions, which are phylogenetically informative and useful for species identification
Where was 16S sequencing applied?
Woese applied 16S rRNA sequencing to define a new kingdom; “Archaebacteria” (archaea)
What is “microbial dark matter”?
How did we uncover it?
What are these organisms important and good for investigation?
99% of bacteria, which cannot be cultured in the laboratory
Only uncovered because of 16S sequencing
Likely to include strains which produce novel antimicrobial compounds or enzymes of potential biotechnological interest
Studying these organisms can also give us insight into the biodiversity and ecology of different environments
What is one approach to sequencing microbial dark matter? (hint - SCG)
Explain this process (4 steps)
- One issue (e.g. patchy)
Single cell genomics
Individual cells are isolated by laser capture microdissection (cut out individual cell)
Separated via microfluids or cell sorting, into different containers (FACS) where genomic DNA is extracted
Individual cells are isolated by e.g. laser capture microdissection (cut out individual cell)
Amplified DNA is sequenced and assembled
- However, amplification is usually uneven, and the assembled genomes will often have patchy coverage; Some regions underrepresented
How does iChip (isolation chip) allow the characterisation of microbial dark matter?
How does it work? (3 points)
It is a way of culturing organisms within their natural environment
Separates individual cells into separate wells
Wells are filled with molten agar and covered with a semi-permeable membrane (prevents contamination), and the device is placed back into the environment the sample was obtained from
This provides essential nutrients and allows a colony to grow from a single cell, to provide enough material for DNA sequencing
What are some negatives of 16S rRNA profiling?
Requires deep sequencing; Need to sequence lots and lots of PCR products
PCR primers may not be truly universal
PCR bias may result in inaccurate quantification; Depending on sequence, some PCR reactions may amplify better or worse
Contamination can be a problem as PCR is sensitive
Sequencing errors may result in over-estimation of the diversity of organisms present; Mistakenly think we’ve discovered new species
Some organisms have multiple distinct copies of the 16S rRNA gene, again leading to over-estimation of the number of species present
Only looking at 16S gene which only tells us about species diversity; Don’t know if its pathogenic or commensal etc.
What does metagenomics involve?
Sequence genomic DNA obtained from an environment, rather than just targeting the rRNA genes like in 16S rRNA sequencing
DNA is extracted from an environmental sample, fragmented and sequenced on e.g. an Illumina
What are the 2 main approaches of metagenomic analysis?
Attempt to identify the species from individual reads using software such as Kraken; Allows us to quickly compare individual sequence reads to a database and identify the species
De novo metagenome assembly; Try to assemble our reads into larger contigs
How do long reads benefit assembly of metagenome?
There are fewer fragments that need assembling
What is a Metagenome Assembled Genome (MAG)?
Genome sequences assembled from microbiome samples
How does PacBio sequencing work? (remember - ZMV)
Polymerase is immobilised at the bottom of a small aluminium well
Fluorescently-labelled nucleotides are incorporated into chain
- However, the well is so small that light can only penetrate a small zone at the bottom (known as a zero-mode waveguide, ZMW)
Fluorescently-labelled nucleotide is incorporated and held within illuminated zone for a prolonged period to produce a stronger fluorescent signal than free nucleotides in solution which diffuse into and out of the ZMW
This allows the incorporated base to be identified by the colour of the label; The action of the polymerase cleaves off the fluorescent label, and allows the chain to be extended
How does Oxford Nanopore work?
Benefit of this method? (hint - size)
Motor protein unwinds two strands of a DNA molecule and feeds one through a pore protein embedded in an artificial electrically insulating membrane
Electrical potential across the membrane changes (for every base) as the DNA strand passes through, and the signal is characteristic of the bases that are going through the pore; The resultant “squiggle” can be converted into a DNA sequence
Can sequence as long as the DNA molecule is; If we keep DNA intact then we can generate very long sequence reads
What was the human microbiome project?
What do different micro-environments show and what do they have an influence on?
Large collaborative effort to characterise the composition of the human microbiome
Different micro-environments show considerable variation in the composition of the bacterial populations present
Shown to have influence on non-infectious conditions e.g. obesity, asthma
Link between obesity and gut microbiome?
Strong link between obesity and the gut microbiome in both mice and humans
Trait is transmissible
- Germ-free mice transplanted with “obese microbiota” show a significantly increased level of total body fat
What do caesarean babies show higher rates of?
What does this show?
Higher rates of allergies and other immune conditions
Early colonisation of infants during the first few months of life is important for future health
What issues did studies show with 16S and metagenomic studies? (hint - control and false positive)
Because these studies involve PCR amplification, they are highly sensitive and prone to contamination
What is the “kit-ome” and why is it a problem?
Sequencing blank control of water discovered that bacterial DNA is commonly found as a contaminant in the DNA extraction reagents
If there is not much DNA in the actual sample, then the contaminant DNA can be amplified and sequenced, and could be misinterpreted as evidence for the presence of microbial life in a sterile sample
What does the observation of the “kit-ome” mean we must do?
Perform appropriate controls to ensure that the conclusions of a study are not influenced by contamination
It is sensible to put blank controls through the same DNA extraction and sequencing processes as the experimental samples
It would also be sensible to use kits from different suppliers to confirm any conclusions about the microbial content of the samples
What is the difference in accessory genome size between open and closed pangenomes?
Closed - Small accessory genome
Open - Large accessory genome
What are accessory genome genes for?
Provide functions that are context specific – Beneficial in specific environments
- Resistance
- Pathogenicity
What 3 factors shape the accessory genome?
Gene gain e.g. through Mobile Genetic Elements and horizontal gene transfer
Gene maintenance
Gene loss
What is transformation (gene gain)?
What is meant by it being conservative?
Major contributor to novelty?
Uptake of DNA from the environment into bacteria
DNA is incorporated via homologous recombination and used to ‘overwrite’ the cells copies
Not a major contributor to novelty; Mainly pickups genes similar to one’s bacteria already has
What is transduction (gene gain)?
Bacterial DNA packaged into phage particles which then infect a new host, transferring bacterial DNA into recipient
What is conjugation (gene gain)?
Drive by which mobile genetic elements?
DNA encoded on conjugative elements is copied and transferred between bacteria via pilus
Plasmids or ‘integrative and conjugative elements’ (ICEs)
DNA is not functional in all genomes. In what case is it likely to be functional?
What is meant by a biosynthetic burden of genes? (give 2 examples)
More likely to be functional if the new bacteria is related to a previous host bacteria
Maintaining novel genes can be costly
- Regulatory disruption
Negative epistatic interactions
What is compensatory evolution (gene maintenance)?
Over time gene adapts to new environment/host via compensatory mutations – Gene becomes regulated and integrated into new host, reducing cost
There are methods of silencing genes, so they cost less; Proteins that bind and repress certain genes
What is meant by genes being context dependent and how can this influence their presence (gene loss)?
Very beneficial/essential in specific environments e.g. resistance
In other environments they are costly e.g. absence of antibiotic/toxin
If fitness of a bacteria carrying a resistance genes depends on level of toxin (e.g. HgCL2), how will fitness change as we increase toxin level?
Fitness increases as the resistance kicks in
Environments tend to be patchy and vary in time. What does this mean for accessory genes
Genes may be very beneficial in one place, in other places it is has no benefit, or is even costly
Therefore may not be lost immediately upon becoming costly
What are the 2 theories for how variation occurs in pangenomes?
Neutral theory – Genes are gained and lost at random; Genetic drift
Adaptive theory – Genome is shaped by selection for environmentally specific genes
Explain neutral theory
Drift is a process by which variation is lost due to random chance events
It is more likely to lose a gene to drift where populations are bottlenecked to small sizes
The more diverse the organism (correlates to pop. size), the more accessory genes you have (genome fluidity)
Explain adaptive theory
Variability correlates with metabolic capability; Open pangenome organism can easily pick up genes; More adaptable
As you acquire more accessory genes, you have additional biosynthetic potential and are more likely to survive in a wide range of environments
How do different environments affect the pangenome?
Change what cells can gain but also what is dispensable in that environment
Causes great variation
Explain with examples how the composition of the human microbiome can be
manipulated to influence health outcomes
Diet can have both short and long term influences on microbiome
Prebiotic (promote bacterial growth) and probiotic (include bacteria) drinks can promote changes
Antibiotics can have impacts with broad spectrum killing many bacteria in body
C. difficile can then cause opportunistic infection
Faecal transplant can help reduce these infections by replacing microbiome