Features of bacteria genomes and analyses Flashcards
Features of bacterial genomes
- Haploid single chromosome
- Diverse range of sizes – lifestyle dependent
- Genomes highly compact – few pseudogenes / non-coding regions
- Highly structured
- Large pan genome – extensive mobilome
- Variability in GC content
History of genome sequencing
First generation sequencing: Shotgun Sanger sequencing
-> produces complete, assembled genomes with annotations
Parallel next generation sequencing: Illumina (short reads), PacBio (long reads)
-> Tends to produce short reads that are pieced together
Following invention of second gen price of sequencing dropped dramatically allowing growth of the human genome project.
short read sequencing
Analysis of short sequence data:
1) Mapping
- Comparison of short reads to a reference
2) Assembly
- De novo assembly and comparison
- Reference free assembly using k-mers
Genome annotation
Once the genome is assembled using short and long read sequencing..
Genome annotation:
a) Location eg. which strand
b) Feature type eg. protein coding, codes tRNA, stop codon etc
c) Attributes eg. product produced, enzyme?, location of product in membrane?
Software to do this
- Prokka
- pubMLST
- EggNOG -> look at evo history of gene
Bacterial genome Features
1) Bacteria have an open pangenome made up of the core genome and accesory genome (HGT of accesory genome)
Bacterial genome content:
Core genome eg. DNA replication
Accessory genome eg. alternative metabolic routes – bring fitness advantages to strain but not in all strains
Mobile elements
Parasitic elements eg. toxins
(Mobile elements and pathogens are part of the accessory genome)
2) High GC variability across bacteria
- GC more stable than AC?
3) Large scale re- arrangement in bacteria (inversion, translocation, genetic islands)
- Gene found in different regions between individuals
Comparing the genome of different types of bacteria
The size and features of bacterial genomes depend on their biology.
Free living:
- large genome/ pangenome
- stable structure (few pseudogenes / TEs, frequent HGT)
- eg. soil bacteria
Facultative / recent pathogen:
- smaller genome/ pangenome
- Many pseudogenes / TEs / repeats, many selfish genetic elements, unstable structure
- eg. Neisseria,
Streptococcus
Obligate symbionts:
- v small, few genes,
- no pseudogenes / transposons, but stable,
- rare HGT
- eg. Buchnera / Chlamydia
Short read assembly: mapping
Reads are aligned to a reference genome using mapping software leading to a ‘pile up’.
Variants are called
Advatages:
- Rapid
- Accurate
- comparable and reproducable
Disadvatages:
- requires high quality reference genome
- mapping cannot identify genes not in the reference
- repeating regions are problematic
Short read assembly: De novo assembly
‘K-mer’ approach: reference-free assembly and comparison, independent of biological information
1) Overlap- layout method
- All of the overlaps between reads are determined then reads and overlaps are all laid out in a graph and consensus sequence is identifies
2) De Brujin method
- Reads are broken into shorter fragments called k-mers followed by construction of a de bruijn graph where overlapping k-mers are connected by edges.
Advantages:
- Reference free
- New genes can be identified
- used to indentify large genomic sequence variants
Disadvatages:
- struggles to resolve repetitive regions
- expensive and time consuming
Limitations to short read sequencing
Struggles to map:
o Low complexity/ repeat regions where the fragment is smaller than the gap
o Intermittent identical repeats
Solution
- Use a combination of long and short read sequencing (e.g. PacBio)
- Hybrid assembly combines the base calling accuracy of short-read sequencing with the scaffolding power of long reads to solve genomic features that are unresolvable by short reads alone
What does sequencing reveal about bacteria?
- Single genome clearly inadequate to describe a species due to the extensive pan genome of bacteria with individuals from the same species having varying accessory genomes.
- Multiple strains must be sequenced for numerous bacteria.
- Degree of HGT varies-> Some pathogens are monomorphic (very clonal with little genetic variation between strains) while most are not.
Overview
The size and features of bacterial genomes depend on their lifestyle (free living, faculative, obligate) but certain key features remain (compact, haploid, highly structured, large rearrangements)
They have a large pangenome making describing species hard (must sequence multiple strains from multiple bacteria).
Types of sequencing:
Short read:
- Lots of short reads are pieced together to produce the final genome
- >Mapping
- >Assembly
—–> Overlap layout consensus
—–>De Bruijn graph method
Long read:
- They span the entire length of low complexity regions
- e.g. Pac Bio
Both methods have advatages and disadvatages
Hybrid method:
- Hybrid assembly combines the base calling accuracy of short-read sequencing with the scaffolding power of long reads to solve genomic features that are unresolvable by short reads alone
Annotation:
- Once a genome is assembled it must be annotated to be understood
- Location, feature type, attritbutes