Lecture 2 Flashcards
How can regions of low GC content been acquired
By horizontal transfer
What did analysis of K-12 genome in regards to HGT lead to
Concluded that 755 of the 4288 genes were likely derived from HGT. These were acquired in at least 234 separate events
What is E.coli O157:H7 strain of e.coli
Is an emergent human pathogen which was first identified in 1982. It’s an enterohaemorrhagic E.coli which produces shiga toxin and is associated with maemorrhagic colitis and haemolytic uraemic syndrome (can lead to kidney failure)
Whats the size of E.coli OH157:H7 strain
The genome is 5.5Mb - 1Mb bigger than K-12
It’s colinear
It was the second genome to be sequenced
What was the extra DNA in the o157 strainb
It was clustered into genomic islands.
There were also some K-islands with regions unique to E.coli K-12.
The O and K islands were located at the same position in the genome. The genome has a patchwork structure with a shared co-linear backbone interrupted by strain-specific islands
What are genomic islands
An extension of the previously used term “pathogenicity islands”
What’s the CFT073 strain of E.coli
It’s a strain of uropathogenic E.coli (UPEC) and was the third E.coli genome to be sequenced in 2002. It’s an example of extraintestinal E.coli (ExPEC) and is associated with UTIs
What is ExPEC and UPEC
Can be harmless when in intestines but become pathogens when they invade the urinary tract, blood or CSF.
UPEC strains are responsible for 70-90% of the 7 million cases of acute cystitis and 250,000 cases of pyelonephritis
Whats the CFT073 genome like
Is 5.2Mb so similar size to O157:H7 genome, the extra sequences relative to K-12 are not the same.
What did the 3 way analysis of the 3 e.coli strains find
Of the total non- redundant set of proteins encoded by any of the 3 genomes, only 2996 are encoded by all 3 genomes.
The total gene set in all 3 strains is 7638, only 2996 are found in all 3 so less than 40% is conserved
What does core genome mean
Genes conserved across all strains of a species
What does dispensible/ accessory genome mean
Genes from a genome which are not conserved in at least one other member of the species
What does pan genome mean
The total set of (non-redundant) genes present in any strain of the species
How big was the S. agalactiae core genome
Estimated at 1800 genes representing 80% of each individual genome
How big is the E.coli core genome and how do we estimate
2200 genes
Estimate the size of the core genome by randomising the order of the genomes and looking at how the size of the core genome reduces as additional genomes are added. This is done lots and the median size of the core genome is calculated
How big is the E.coli pangenome and how do estimate
Infinite
Estimate in a similar way as the core genome. However, the trend line does not plataeu, it approaches a straight line sloped upward because E.coli have open pangenomes
What is an open pangenome
There are effectively infinite in size
What species have a closed pangenome
Yersinia pestis
How do you estimate the open or closed nature of a pangenome
Estimate how many new genes are discovered with each genome sequenced for E.coli. This plateaus to a non zero value of around 300 genes - you can continue to sequence even large numbers of E.coli genomes and will keep on identifying new genes indefinitely
What does illuminia sequencing involve
Similar to sanger sequencing but rather than sequencing a single molecule at a time, illumunia can sequence millions of molecules simultaneously (massively parallel sequencing)
Uses florescently labelled nucleotides. We need to amplify each individual fragment of the genome by PCR-like reaction known as bridge amplification
How do we determine the sequence of each cluster in illumunia
Synthesising a complementary strand using fluorescently labelled nucleotides
What are the outcomes of illuminia sequencing
Can generate short sequence reads - can be assembled into contigs but still require finishing.
What does illumunia not involve
A cloning step, instead most assembly gaps are due to repetitive regions where the assembly is ambiguous
Disadvantage of illuminia and how is this overcome
Finishing is much more expensive than generating a draft so most genomes are left at the draft stage.
Contigs can be placed into order by comparison with a closely related complete genome
Whats the average protein coding content for a bacterial genome
88% for the 2671 finished genomes in genbank.
Whats the largest genome and the smallest genome
Sorangium cellusosum at 14782125 and the smallest is Candidatus nasuia deltocephalinicola strain coding 137 proteins and is 112091 bp in length
What do genomes of bacteria from complex environmental habitats tend to have
A larger size and have a greater GC content than host associated bacteria. Most of these bacteria are mesophiles but there are growing numbers of extremophiles such as thermotolerant
Disadvantage of sanger sequencing
Finishing draft genomes was more labour intensive and required a separate production line to be efficient
What did increase in high throughput “next generation sequencing allow”
Cost of producing raw sequence data declined to the point that it currently cost less than $1 to generate a draft bacterial genome, made sequencing bacterial genomes cost effective and obligatory for any research team
What did NGS produce
Shorter reads than sanger sequencing. So the cost ratio between a draft and a complete sequence was changed - wasn’t as cost effective
What does single molecule sequencing produce
Examples: pacbio and MinION produce longer reads than NGS. Generate more sequence for less money but may eventually eliminate the concept of draft microbial genomes
What could the size range of E.coli be due to
Due to the large number of available sequenced strains. Less frequently sequenced species can vary by more than a megabase such as haemophilus influenzae HK1212 and F3047
What do all bacterial genomes have at least one copy of
23S, 16S and 5S rRNA genes - these exist as an operon with a conserved structure of 23S gene followed by one or more transfer RNAs
What do transposable elements range in size from
1 -52kb and work with several families of insertion sequences and integrative and conjugative elements
What is the CRISPR-Cas system
A general stress response, provides a type of immunity and those that are pathogenic to the host. 40% of bacteria have a CRISPR-Cas system that allows them to fend off viral attacks
What does HGT play a role in (regarding defence islands)
Plays a role in maintenance and evolution of these defence islands (on average 5.7 genes)
What’s the first step in metagenomics
Is the collection and processing of environmental samples such as water, soil etc.