Comparative genomics & Metagenomics Flashcards
What is comperative genomics and what is the general motivation behind it?
The study of the relationship of genome structure and function across different biological species or strains.
It is done by studying evolution.
Motivation:
Transfer knowledge from and to simpler model organisms
What is sanger sequencing?
Chain termination method: Marked dideoxynucleotides that will stop the strand synthesis
1977 first sequenced DNA genome of a Phage (small viral genomes that only encode 4-10 genes)
- capillary-based, semi-automated
- bottleneck: DNA fragments need to be cloned and amplified in bacteria
- simultaneous electrophoresis in 96 or 384 independent capillaries
→ sets limits to parallelization
What is next generation sequencing (NGS)?
aka deep-sequencing, high-throughput sequencing
- 500 Mb – 600 Gb / sequencing run possible
- major genome centers: 1’000 sequences per second
- → trick: massively parallel cyclic-array sequencing
What are global advantages and disadvantages of NGS relative to Sanger?
Global advantages:
- in vitro construction of sequencing library
- in vitro clonal amplification
- array-based sequencing → much higher degree of parallelism
(hundreds of millions of sequencing reads) - array features are immobilized → can be enzymatically manipulated by a single reagent volume
- lower costs for DNA sequencing (10 - 250 times cheaper)
Disadvantages:
- short read-length (30 – 350 bp)
- accuracy at least 10-fold lower than by Sanger sequencing
What is 3rd generation sequencing?
Single-molecule-sequencing without need to pause between read steps
Goals:
- higher throughput
- faster turnaround time (sequencing metazoan genomes in minutes)
- longer read lengths
- higher accuracy
- small amount of starting material (theoretically one molecule needed)
- low cost (< 100 $ for one human genome !!)
PacBio sequencing (SMRT sequencing):
Fluorescence-based detection of dNTP incorporation in real time
Nanopore sequencing:
change in current is depending on physical and chemical properties of molecule that passes through the nanopore
What did the complete sequence of a human genome do and how was it achieved?
- removes a 20-year-old barrier that has hidden 8% of the genome from sequence-based analysis
- this 8% of the genome has not been overlooked because of a lack of importance but because of technological limitations
- used PacBio HiFi and Oxford Nanopore ultralong-read sequencing
Why comparative genomics?
- To understand the genomic basis of the present
- Differences in lifestyle
- pathogen vs. non-pathogen
- obligate parasites vs. free-living
- Host specificity
- animals vs. plants, plant A vs. plant B, etc
- In the case of pathogens: this understanding should help us in fighting disease
- Differences in lifestyle
- To understand the past
- How organisms evolved to be what they are
–> Molecular phylogenetics
What is molecular phylogenetics?
The use of molecular data to establish the relationship between species, organisms or gene families.
What are Homologues?
What two categories are there?
Sequences/genes that derive from a common ancestor-gene
Homology is an all or nothing relation: related genes are not (e.g.) 80% homologous, but 80% similar/identical
Categories:
Orthologous genes: homologs in different species derived by a speciation event
Paralogous genes: homologs in the same species derived by a duplication event
One paralogue of a pair often retains the ancestral gene function → the other is free evolve and adopt new functions
(thus homologous sequences have same evolutionary ancestor)
What is Convergence?
Convergent evolution creates analogous structures that have similar sequence/form/function, but that were not present in the last common ancestor of those groups
Example:
Lysozyme c of different unrelated organisms evolved convergently. The fact that they all have to be functional in the acidic stomach milieu, resulted in a similar amino acids composition in the active site.
What is comparative genomics good for in Evolution?
- neutral evolution is „fast“ → e.g. pseudogenes cannot be identified as such after relatively short period of time
- thus whenever a sequence (DNA, RNA, protein) is conserved, one can conclude that an evolutionary pressure exists (functional constraints)
What is comparative genomics good for in Function prediction?
- conserved sequences indicate that these regions of the molecule are functionally important!
- conserved nt or aa most often have similar functions in homologous protein, DNA or RNA molecules
- with the help of comparative genomics one can predict the functions of molecules based on comparison with the already characterized homolog
- Comparison of protein domains:
- identification of a conserved protein domain and its comparison with homologous proteins can help in unraveling the protein function
- statements about gene functions can be made on the genome-wide level
- since it is very unlikely that one will be able to study all genes/gene products of a particular organism on the function/structural level
- even for well studied organisms (such as E.coli, S. cerevisiae) we do not yet know role of every gene
What did the Homology analysis of the yeast genome show?
- 30% of all genes previously know
- function of 30% of all genes could be assigned based on homology search
- 10% of the genes had homologs in database; function unknown
- 30% of all genes (23% +7%) showed ORFs, but lack homologs in database

What does synteny mean originally and what is today’s meaning?
Synteny (original meaning):
gene loci are on the same chromosome within an individual or species
Conserved (shared) synteny (today’s meaning):
- describes preserved co-localization of genes on chromosomes of different species
- two or more genomic regions are derived from a single ancestral
- genomic region
How can genome alignment be visualized and how is it interpreted?
Pairwise alignment (dot plot)
- Match chromosome sequence from species A to species B
- If the sequences (gene order) were identical, we would see a straight line (identity)

Where do inversions happen most often?
Seem to happen around the origin or terminus of replication
What is comparative genomics good for in diseases?
- if the function of a disease causing gene is not known but a homolog in an other (micro)organism has been identified then function of the disease gene product can be deduced
-
Example: e.g.: Bloom‘s Syndrome
- mutation in the gene causes growth defect in humans. The yeast homolog codes for a DNA helicase (which is involved in rRNA transcription and DNA replication)
- NGS has revolutionized infectious-disease research
- Bryant et al. (Science, 2016) sequenced 1’080 Mycobacterium abcessus isolates from 517 patients around the world
- M. abcessus causes human disease in several tissues (e.g. lung; especially in cystic fibrosis patients)
- Infections thought to be acquired exclusively from environment
What did the Phylogenomic comparisons of sequencing data of three subspecies of M. abcessus reveal?
- whole-genome analysis revealed unexpected similarity at genomic level from geographically diverse locations
→ Does not argue for environmental acquisition
- So far unknown human-to-human transmission (via asymptotic carriers, long lived cough aerosols, or infected surfaces)?
- Dominant clones had more mutations associated with drug resistance & correlated with poorer clinical outcomes
- Such deep-sequencing-based genome comparisons can potentially capture snapshots of evolution (they also sequenced genomes of M. abcessus over time from the same patient)
What is metagenomics and what questions can it help answer?
- Sequencing of DNA from environmental samples
- allows to study complex (e.g. microbial) communities
- no cultivation of species needed
–> Allows you to sequence new organisms that can’t be cultivated in the lab
-
Information about : Biodiversity but also physiology, metabolic pathways…
- Environmental sample → all DNA sequencing
- Who is there ? Biodiversity characterization → new organisms
- Who does what ? Physiological characterization → new genes
- 90-95% microorganisms remain uncultivable in laboratory
- -> Tremendous knowledge gap about biodiversity
- at the last count 1.8 million species were known to science
- •metagenomics is promising to change our view of life on Earth
- expected that billions of life forms are out there we never knew existed
What are the main priciple of metagenomics (workflow)?
- Sample collection
- Whole DNA extraction
- Whole DNA amplification
- Whole DNA sequencing
- Data analysis
What problems does metagenomics have?
- often fragmentary
- often highly divergent
- lack of reference genomes
- no organism of origin
- ab initio ORF predictions
- huge data
What is the Marine Genome Sequencing Project?
Measuring the genetic diversity of ocean microbes
- almost 1,000 genomes for uncultivated microbes
- 6.12 million new proteins uncovered
- 1,700 totally unique large protein families (mainly viral)
What is MetaHIT?
Metagenomic of the Human Intestinal Tract
- Funded by European Commission, January 1, 2008 → lasted for 4 years, Scientists of 8 countries
- determine whether a “core” (standard) human microbiome exists
- establish associations between the genes of the human intestinal microbiota and human health and disease
- faecal samples from 124 human adults
- healthy individuals
- sick individuals
- Inflammatory Bowel Disease (Crohn’s disease, Ulcerative colitis)
- obesity
- determined a total of 576.7 Gb of DNA sequence prepared from stool samples (an average of 4.5 GB of sequence was generated for each sample)
- 3.3 million different microbial genes in the gut of the individuals (150-fold more than in our own genome)
- each individual carries 536,000 microbial genes
→ ~160 microbial species - in total: 1’000 -1’150 bacterial species found in the 124 individuals
- Even for the most common 57 species present in > 90% of individuals, the inter-individual variability was between 12- and 2’187-fold
The expected final achievements of the project should be the discovery of associations between bacterial genes and human disease → preventive and personalized medicine
What does the gut microbiome affect?
Affects childhood growth:
- childhood undernutrition accounts for ~ 50% of all deaths in infants under the age of five worldwide
- childhood malnutrition has been associated with an altered microbiota
- modifying the gut microbial communities in mice can alleviate diet-associated growth deficits
- It is becoming increasingly apparent that our diet, gut microbiota and health are inextricably linked.
Gut microbiota influences obesity:
- The gut microbiota co-develops with the host (rats on a high fat diet) and modulates whole-body metabolism by affecting energy balance
- acetate molecules from dietary nutrients by the gut microbiota signals to the brain
- triggers secretion of the ‘hunger hormone’ ghrelin from stomach → increased food intake
- also potentiates glucose-stimulated insulin secretion from β-cells in the pancreas, promoting calorie storage and fat gain
- mechanistic link between onset of obesity and the gut microbiome
Gut microbiota regulate neuronal function and fear extinction learning:
- Single-nucleus RNA-Seq revealed changes in microglia and neurons related to synapse organization & assembly
- Metabolomics revealed 4 metabolites to be down in germ-free mice:
- phenyl sulfate
- pyrocatechol sulfate
- 3-(3-sulfooxyphenyl)propanoic acid
- indoxyl sulfate
- → new insight into the co-evolved relationship between the microbiota, the nervous system and mammalian behavior