SAQ Flashcards
when thinking about sequencing platforms what is normally the trade off between the different generations
trade off between producing lots of reads (short) or long reads but not many
read depth vs length of reads
what is a flow cell cluster in illuminca
each cluster corresponds to a separate read.
has been amplified by bridge amplification
what can be used to increase cluster density and how
patterned flow cells
flowcells with nanowells in a distinct pattern. Each nanowell contains DNA probes to capture DNA strands for amplification but the regions within wells do not contain probes and thus are free of reads.
+ it reduces the problem of adjacent clusters overlapping
+ allows you to control the sizing of clusters
+ position of well is known so cluster can be easily identified
+ packed very densely so can get out more sequence data
- generates duplicated sequences
- Why does the quality of a read decrease over its length in illumina?
PHASING
- illumina relies on sequence by synthesis approach in which errors can occur
- usually 4 dNTPs washed over, one incorporated and terminator. Terminator then removed and another dNTP can be incorporated
- phasing occurs when this terminator is not successfully removed. The next nucleotide cannot bind so fro now on this DNA sequence will be one base behind the rest in the sequencing.
- Over time these errors accumulate and pollute the fluorescence signal
- also prephasing - terminator cap defective so one fragment can go ahead and incorporate 2 nucleotide in one cycle
2 colour illumina sequencing
also known as 2 channel sequence by synthesis
generates data faster that 4 colour whilst maintaining quality and accuracy
- only 2 images per cycle are required
- CONS:
incorrect base calls because of phasing lead to a rising pollution of the light signals over time, making it more difficult to differentiate the bases and to interpret the base quality
No colour could also mean that no base has been incorporated
list the major properties of the E.coli K12 genome
most frequent strain in labs
- commensal organism that survives in the lower intestine
- able to survive in culture only under very specific conditions
- unable to survive at all in gut
- 4288 protein coding genes
- regions of low GC content
describe whole genome shotgun sequencing and how gaps in sequence can be informative of regions with potential biotechnological applications
- shotgun sequencing requires shearing of DNA, selection for specific size fragments and placing in a vector insert into bacteria. inserted fragments can then be sequenced using sanger with primers that overlap the vector backbone. then mapped onto a reference genome and assembled into contigs
- Gaps in genome following seqencing must mean that insertion of these sequences into plasmid in bacteria leads to their death (toxic)
- can identify these gaps - sequences/genes that are toxic to e.coli by interacting with the replication initiator DnaA
what was found when the EHEC O157:H7 genome was compared to K12
- genome was 1Mb bigger than K12
- O islands - only present in O157 eg type III secretion system/ shiga toxin
- K - islands - regions only present in K12
what kind of e.coi is CFT073
UPEC - uropathogenic e.coli
assocaited with UTIs
harmless in intestines but become pathogens when they invade urinary tract, blood or CSF
genome similar size to O157 but the extra sequences relative to K12 are not the same as O157
Define the terms “pangenome” and “core genome” and how they can be estimated
Core genome represents all the genes present in all the strains of a species. Typically estimated by comparing WGS of multiple genomes. As more genomes are compared this number decreases. Rasko paper estimated the E.coli core genome to be around 2200 genes which are mainly involved in metabolic processes
pangenome is the entire gene set of all the strains of a species including the core genome and the variable/accessory genome
broad sample of the diverse pathogens that
comprise this speciesx. doesn’t come to a plateu showing that the e.coli genome is effectively infinite (open) - suggests it must still be evolving
what was the estimate of the number of unique genes per e.coli genome sequenced
300 genes
since advent of illumina more draft genome sequences have become available why arent they finished
bc finishing and annotation remains a laborious process
what can you do when you have lots of genomes for a bacterial species
bacterial genome wide association study
eg study looking at campylobacter. Sequenced genomes of campylobacter from a whole range of different hosts (chickens, cows, birds)
looked at phylogeny between strains
not one evolutionary lineage is associated with one host
GWAS found genes from vitamin B5 synthesis pathway (PanBCD) are present in bacterial strains that infect cows but not chickens
- (isolates from cattle grew better, on average, in a low vitamin B5 environment than isolates from chickens)
** gene Cj0299 which encodes an enzyme giving resistance to beta lactam antibiotics found at highest frequency in cattle and was rarest in bird isolates **
Explain how multiple genome sequences can provide an insight into genome evolution and horizontal gene transfer
By comparing multiple genomes of the same species a core genome can be derived. Different strains accessory genomes can then be identified that give rise to their overall phenotype. Some of this accessory genome can be derived by horizontal gene transfer recently in which regions of the genome would have abnormal GC content due to the fact that it hasnt gone through ameriolation yet.
Describe how the development of new technologies has influenced our understanding of E.coli/Shigella diversity
before the advent of molecular biology: serotyping: based on the immune recognition of cell surface antigens - bacteria of the same serotype cross react to the same antibodies (doesnt correlate v well with similarity on genetic level)
hybridisation of different strains to see how similar they are at a molecular level - found that E.coli and shigella are comparable - shigella tended to be more diverse
MLEE - multi level enzyme electrophoresis: characterises organisms depending on the electrophoretic mobility of its proteins. - allowed construction of ECOR collection
comparison of gene sequences found that shigella is within the evolutionary diversity of e.coli and has arisen on multiple occasions from e.coli - some properties shared were examples of convergent evolution
MLST - multi locus sequence typing- sequence multiple genes and compare (usually housekeeping 400bp chunks)
- found lineages of e.coli have acquired the same virulence factors in parallel including a pathogenicity island involved in intestinal adhesion and phage-encoded Shiga toxins.
- Sequence 8 HK genes in 46 shigella strains representing each of the 4 serotypes
Shigella strains are well distributed within the diversity of E. coli
presence of three major clusters and five forms not closely related to any other suggests that the Shigella phenotype has arisen eight times
what can be used to compare the diversity across species
16S rRNA sequencing
how can 16S rRNA profiling be used to investigate microbial diversity
16s rRNA can be used to investigate microbial phylogeny due to the fact that all microbes have it bc of its importance in translation. evolves slowly due to its fundamental function in cell. Has variable regions that vary more and can be compared to build a picture of the phylogeny
- design primers to conserved regions that span these variable regions, amplify up and sequence
- can be used to determine the evolutionary relationships between strains/species of bacteria
what are the caveats of 16s rRNA profiling
- primers may not be truly universal
- contamination may be an issue (link to paper about contamination in DNA extraction kits)
- sequencing errors can result in overestimation of diversity of organisms present
- some organisms have multiple copies of the 16s rRNA gene which vary in sequence (overestimation of diversity)
- PCR bias may result in incorrect quantification of species
what was the first major application of the use of 16s rRNA to study the diversity of organisms
Carl woese 3 domain tree of life
what can the microbial dark matter problem also be called
great plate count anomaly - observation that most of. the microbes seen in the microscope cannot currently be grown
what is metagenomics
study of genetic material recovered directly from environmental samples
Tells you what genes are encoded by and what bacteria in your sample
what did craig venter to double the size of GenBank
sample from sea, genome extracted, fragmented and sequenced using sanger. Revealed some “dark matter”
how can metagenomic techniques be used to study the human microbiome
swabs can be taken from different individuals and sequenced for different areas of the body
Revealed link between health and bacteria in body
Obesity: reduced ratio of Bacteroidetes to firmicutes
what is an example of when metagenomic sequencing results can be misinterpreted
went round collected swabs from New york and sequenced using shotgun sequencing. claimed plague was present when in fact there were no reads mapped to the yMT gene (toxin) - in first paper they claimed it was present but that was a mistake.. they were actually looking for related plasmids
when sequencing one strain - they found most closely related in database was anthrax so concluded anthrax was present - it wasnt ( no evidence of pIcR-SNP - a defining feature of anthrax)
what is single cell genomics (exploring unculturable microorganisms)
the amplification and sequencing of DNA from single cells obtained directly from environmental samples
single cells isolated: FACs, laser microdissection, optical tweezer, micropipetting
PCR amplification and sequencing
• Amplification is challenging and the assembled genomes will often have patchy coverage
what is iChip
a method of culturing previously uncultural bacteria
environmental sample eg soil is in diluted molten agar and nutrients until one cell is in one well
the chip plate is then placed back in the soil to access nutrients unavailable in the lab
50-60% of species are able to survive
Teixobactin antibiotic discovered using iChip in 2015
Anticancer agents, anti-inflammatories and immunosuppressives also discovered
the main points of the 2011 E.coli outbreak
mainly young women affected
Haemolytic ureamic syndrome
found to be caused by the unusual serotype O104:H4
wrongly found spanish cucumbers as cause - problem with a public case/sharing
BGI sequencing showed that the strain had EAEC properties and closely resembled the 55989 strain found in a HIV patient in africa
- all sequence data published online (crowdsourcing) to construct phylogeny - outbreak closer to EAEC
- Recognition of outbreak strains was hampered by the inappropriate use of diagnostic tests focused on O157:H7
After outbreak
PacBIO sequencing showed that the strain had evolved from EAEC and acquired EHEC like properties
- PCR confirmed shiga toxin
- plasmid bearing beta lactamase gene
confirmed to be caused by beansprouts grown in lower saxony from egyptian fenugreek seeds
- separate smaller outbreak in france where these seed were also used
what did O104:H4 not have showing it isnt a EHEC
Type III secretion system
main points from salmonella outbreak
isolates from 16 patients sequenced on MiSeq
- found they were all part of same outbreak strain
- availability of definitive typing data so early on enabled identification of transmission between hospital wards and action to be taken
- can be serotyped in 40min and determined to be part of outbreak in 2h
outbreak chain found on door seal of a food trolley
what was used to sequence ebola in africa
MinION
main points from ebola case study
142 ebola virus samples sequenced in real time in Guinea using MinION
combined with 603 sequences from other studies to create phylogenetic tree
- allowed people to be quarantined
- found transmission across border of guinea by integrating data with another team in sierra leone
- Within 48 hours the new sequence could be added to the phylogenetic tree. Could isolate individuals and prevent further spread of the outbreak, infer were an individual had got the virus- narrow down the transmission chain.
what are the challenges of sequencing in the field
Power supply, head torches, communication was a large problem.
Had to use uninterruptable power supplies (UPS)
Internet was a challenge- responding to emails impossible and the upload of reads for bioinformatic analysis in the UK a daily challenge.
coronavirus main points
one month after first case the full genome was published 82 per cent similar to Sars but also 90 per cent similar to a bat coronavirus
- helps with the design of diagnostic kits
- very little genetic variation between the first 10 patient samples sequenced (RNA virus has v high mutation rate) sign the virus recently jumped from animals to humans. suggests one transmission - questionnaires point towards a meat market
- clusters near bat virus suggesting it originated in bats and was transmitted to humans
listeria outbreak key points
19 people infected in 9 states whole genome sequenced
Sowed they were genetically related eg 1 common source
Epidemiologic and laboratory evidence indicated that packaged salads were the cause
what signatures do active enhancers usually have
nucleosome free and the regions flanking them have characteristic post translational modifications
how can enhancers in the genome be identified
DNase hypersensitivity assay
cant chop DNA where there are histones – lots of cutting in nucleosome depleted regions
main points of encode paper
used biochemical definition of function
analysed different cell lines (-majority ES or cancer)
looked for 5 signatures:
RNA expression (RNA-seq), DNA/protein interactions (TF ChIP-Seq), chromatin accessibility (DNase hypersensitivity), 3D structure, methylation (RBBS)
- 3 different tiers depending on which assays they did
- found 80% to perform some reproducible biochemical function and defined this as functional
3 definitions of function
causal role - a sequence has a function if that sequence causes F to happen (heart adding weight to body)
Selected effect - a sequence has a function if the sequence exists because of the function (heart pumping blood)
genetic function: a sequence has a function if the sequence is required for the function and deleting the sequence affects the function