Metagenomics Flashcards

1
Q

What environmental processes are microbes responsible for?

A

o Most of the biogeochemical cycles on earth [Cycling of substances through which a substance moves through the biotic and abiotic]
o Waste processing
o Growth & reproduction of plants & animals
o Production of antibiotics, food fermentation & maintain human health.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Metagenomics?

A
  • The study of genetic material recovered directly from environmental samples
  • It involves pooling and studying the genomes of all the organisms in a community -> all the functions encoded in the community’s DNA (metagenome) can be studied
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does Metagenomics let us find?

A

o Genetic info on potentially novel biocatalysts / enzymes
o Genomic linkages between function & phylogeny for uncultured organisms
o Evolutionary profiles of community function & structure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the steps in a Typical Sequence-Based Metagenome Project?

A
  1. Experimental Design
  2. Sampling
  3. Sample fractionation
  4. DNA extraction
  5. DNA sequencing
  6. Assembly
  7. Annotation
  8. Statistical analysis
  9. Data storage
  10. Data sharing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the foundation of a good Metagenomics study?

A

Experimental Design

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What criteria should extracted DNA satisfy?

A

o High quality
o Representative of all cells present in sample
o In sufficient amounts for library production & sequencing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are 4 processing methods used in Metagenomics studies?

A
  1. Physical fractionation:
    o Applicable only when certain parts of community are the target of analysis (like viruses in seawater)
  2. Physical separation & isolation of cells from samples:
    o Might be necessary to maximise DNA yield or avoid co-extraction of enzymatic inhibitors (Like humid acid in soil - stick to exposed DNA)
  3. Lysis of cells:
    o Direct lysis in soil has quantifiable bias vs indirect lysis in terms of: Microbial diversity, DNA yield, Resulting sequence fragment length
  4. Multiple Displacement Amplification (MDA):
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the process of Multiple Displacement Amplification ?

A
  • Non-PCR based DNA amp technique.
  • Anneals random hexamer primers to template
    o No denaturation required, increase in [Hexamers] is sufficient to allow slow initial priming step
  • Once reaction starts strand-displacing mechanism of MDA releases ssTemplate for ongoing priming & amp
  • phi29 polymerase extends primers till they reach the next primer (start of a dsDNA section)
  • ph29 displaces the dsDNA strand it just hit and continues polymerization (‘under’ the displaced strand)
  • New primers bind to displaced strand -> polymerization again -> hyperbranched structure
  • MDA generates larger sized product with lower error frequency than conventional PCR amplification
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are 3 sequencing methods used in Metagenomics?

A
  1. Classical Sanger Sequencing
  2. 454/Roche System / Pyrosequencing
  3. Illumina
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why is Sanger still considered the gold standard sequencing technology?

A

o Low error rate
o Large insert sizes
o Long read length (>700bp)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

When is Sanger sequencing applicable?

A
  • Applicable if objective is generating close-to-complete genomes in low-diversity environs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the disadvantages of Sanger sequencing?

A

o Labor-intensive
o Bias against genes toxic to host
- [because of large insert size, full length genes could be included which would be expressed and kill host]
o Overall cost per Gb (±400 000 USD)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Roche system summary info

A
  • Based ‘sequencing by synthesis’ principle
  • Relies on detection of pyrophosphate release on nucleotide incorporation
    o Sanger relies on chain termination with diDN
  • Uses emulsion polymerase chain reaction (ePCR) to clonally amplify random DNA fragments attached to microscopic beads
  • Much cheaper than Sanger (± 20 000 USD per Gbp)
  • Avg read length = 600-800bp
  • Offers multiplexing (up to 12 samples of ±500 Mbp in a single run
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the steps of Pyrosequencing via the 454/Roche System?

A
  1. DNA Library constructed -> DNA Fragments ligated with adaptors
  2. Strand amplification by ePCR on surfaces of 100 000’s of agarose beads
  3. Surfaces of beads have mills of oligomers -> each is complimentary to adaptors on fragments
  4. ePCR uses vigorously mixed oil & aqueous mixture -> isolate individual agarose beads (each bead with individual unique DNA fragment hybridized to its surface
    a. Isolated in aqueous micelles that also contain the PCR reactants
  5. Micelles pipetted into wells of microtiter plate -> temp cycling produces > 1mil sequence-ready beads
  6. Each bead has up to 1mil copies of original annealed fragment
  7. Beads added to surface of 454 pico titer plate (PTP)
    a. PTP: Single wells in tips of fused fiber optic strands (1 bead in each well)
  8. Smaller magnetic & latex beads (attached to active enzymes needed for pyrosequencing) added to surround DNA-containing agarose beads in PTP
  9. PTP placed in sequencer, nucleotide & reagent solutions delivered into it in sequential fashion
  10. Binding of nucleotide releases APS -> ATP sulfurylase + APS converts PPi to ATP ->ATP + luciferase -> oxidation of luciferin -> light
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the Illumina sequencing average read length?

A

±150bp

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the cost of illumina?

A

±50 USD per Gbp

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the drawbacks of Illumina?

A
  • Limited read length-> increased proportion of assembled reads which may be too short for functional annotation
  • Limited systematic errors - But some datasets have high error rates at tail ends of reads
    o Can clip reads to eliminate the ‘bad’ datasets
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Why is Assembly necessary?

A
  • Assembly of short read fragments is necessary to obtain longer genomic contigs to:
    o Determine genome sequence of uncultured organisms
    o Obtain full-length CDS (coding DNA sequence) for subsequent characterization
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is a Pangenome?

A

o Entire gene set of all strains of a species. Includes:
o Core genome (genes present in all strains)
o Variable genome (genes present in only some strains)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Why are assembly algorithms that assume clonal genomes less suitable for Metagenomics?

A
  • Microbe comms have significant variation at strain & species level
    o Because the ‘clonal’ assumptions built into many assemblers might lead to suppression of contig formation for some heterogenous taxa at specific parameter settings
    o De Bruijn-type assemblers deal explicitly with non-clonality of natural populations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are the 2 Assembly strategies for Metagenomics samples?

A

o Reference-based assembly (co-assembly):

- Works well if closely related reference genomes are available
- BUT: differences between sample genome & reference (large insertion, deletion etc.) can -> fragmented assembly or in divergent regions not being covered. 

o De novo assembly:
- Typically requires larger computation resources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is Binning?

A

The process of sorting DNA sequences into groups that might represent an individual genome or genomes from closely related organisms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are the 2 types of info within a DNA sequence that binning algorithms use?

A
  1. Genomes have a conserved nucleotide comp which will also be reflected in genomic DNA fragments
  2. An unknown DNA fragment might encode for a gene which is similar to known genes in a reference database
24
Q

When using any binning algorithm what important considerations should be thought about?

A

o The type of input data available

o The existence of suitable training datasets or reference genomes

25
Q

What are the 3 methods of binning?

A
  1. Compositional Binning
  2. Similarity Binning
  3. Fragment Recruitment
26
Q

What is Compositional Binning?

A
  • Uses sequence composition to classify/cluster metagenomic reads into taxonomic groups
  • Genomes have a conserved nucleotide comp which will also be reflected in genomic DNA fragments (%GC or a particular abundance distribution of k-mers)
  • Not reliable for short reads as they don’t contain enough info
    • 100bp read can at best possess <50% of all 256 possible 4-mers (subsequence of 4 bases)
    • Not sufficient to determine a 4-mer distribution that will reliably relate this read to any other read
  • Compositional assignment can be improved if training datasets (a long DNA fragment of known origin) exist that can be used to define a compositional classifier.
    • ‘Training fragments’ should ideally contain a phylogenetic marker (like a rRNA gene) that can be used to high-resolution, taxonomic assignment of binned fragments.
27
Q

What does compositional binning produce?

A
  • Produces:

• Taxon Abundance Profile

28
Q

What is similarity binning?

A

Classifies a read into a taxonomic/phylogenetic group based on similarity to previously identified genes or proteins

29
Q

What does Similarity Binning produce?

A

Taxon or Phylogenetic Abundance Profile

30
Q

What is Fragment Recruitment?

A
  • Method of Binning

- Reads aligned to nearly identical genome sequences -> metagenomic coverage estimates of the genome

31
Q

What does Fragment Recruitment Produce?

A
  • Produces:

• Genome or Contig Coverage Profile

32
Q

What 2 broad processes are involved in Analysis?

A
  1. Assessing Taxonomic Diversity

2. Inferring Biological Function

33
Q

How is taxonomic diversity typically quantified during analysis in a Metagenomics study?

A
  1. Marker Gene Analysis (like 16S rRNA)
  2. Binning (grouping sequences into defined taxonomic groups)
  3. Assembling sequences into distinct genomes
  • Not mutually exclusive approaches; may be synergistic
    o Could bin sequences into taxonomic groups -> assemble each groups sequences
    o Or could conduct initial assembly then bin
34
Q

How can a taxonomic diversity assessment determine similarity between 2 or more communities?

A
  • Determines which microbes are present in community & their relative abundance
  • Profiles a community & can be used to determine similarity of 2 or more communities
    o More shared taxa = Greater similarity between communities
  • May give insight into biological function of community when it contains members of functionally described taxa
    o i.e. presence of cyanobacteria suggests community is photosynthetic
35
Q

What is Marker Gene Analysis? And what does it involve?

A

A straightforward & quantitatively efficient way of quantifying a metagenomes taxonomic diversity

Involves:

1. Comparing reads to a database of taxonomically informative gene families (marker genes)
2. Identifying reads that are marker gene homologs
3. Using similarity to marker gene database sequences to taxonomically annotate each metagenomic homolog
36
Q

What are the most frequently used marker genes?

A

o rRNA genes or protein coding genes

o all tend to be single copy & common in microbial genomes

37
Q

What are 3 important caveats in marker gene analysis?

A
  1. Assumes the fraction of the metagenome that is homologous with marker genes is an accurate representation of community’s taxonomic diversity
  2. Genome sequences available during marker gene identification may not adequately reflect diversity of genomes present in community
  3. MG analysis not appropriate for taxa that don’t contain the markers being explored
38
Q

How is a functional profile produced during Data analysis and what can it be used for?

A
  • A functional profile describes the number of distinct types of functions (and relative abundance) in a metagenome.
  • Produced by identifying reads with protein coding sequences & comparing them to a database of genes, protein families, metabolic pathways etc. for which we have functional info
  • Profile can be used to:
    1. Compare metagenomes to identify metabolically similar communities
    2. Ascertain how various treatments influence the functional comp of community
    3. Identify functions associated with specific environs or host-physiological variables
39
Q

What is an ORFan?

A
  • ORFans are a subset of taxonomically restricted genes (TRGs) which are unique to a specific taxonomic level
  • Any sequence that can’t be mapped to known sequences is an ORFan
  • ORFans are responsible for the genetic novelty in microbial metagenomics
40
Q

What are the 3 hypotheses regarding ORFans?

A
  1. ORFans simply reflect erroneous CDS calls caused by imperfect detection algorithms
  2. ORFans are real genes, but encode unknown biochem functions
  3. ORFans have no sequence homology with known genes, but might have structural homology with known proteins, thus representing known protein families
41
Q

What is involved in a typical functional annotation workflow?

A
  1. Each read put through gene prediction to identify subsequences that may encode proteins
    a. You can get partial gene predictions when coding sequences start or stop up/downstream on the length of the read (i.e. start/stop in middle of read)
  2. Each predicted protein is compared to database of protein families
  3. Predicted peptides classified as homologs of the family are annotated with the family’s function
42
Q

What steps are involved in functional annotation?

A
  1. Gene prediction
  2. Gene / functional annotation
  • Non-mutually exclusive
43
Q

What is gene prediction?

A
  • Determines which metagenomic reads contain coding sequences
    o Once identified they can be functionally annotated
  • Can be done with assembled or unassembled sequences
  • Not all predicted genes will exhibit homology to known sequences
    o Because of large diversity of genomes in nature compared to number in databases
    + Some predictions will be not what they claim to be
    + Some will represent novel or highly diverged proteins
  • Therefore, GP is critical in identifying novel genes
  • Consensus approach (using multiple methods) improves gene prediction
44
Q

Gene Prediction for assembled metagenomes with full-length coding sequences

A
  • Gene prediction is similar to process of analysis of whole genome sequences
  • Caveat:
    o Some prediction algorithms require species-specific parameters that not always appropriate when contigs have been sampled from diverse/novel lineages
    o Algorithms typically trained using sequence features of already sequenced organisms - problem if you looking outside of these model organisms
45
Q

Gene Prediction for unassembled/poorly assembled metagenomes

A

o Involves predicting partial coding sequences when a gene starts upstream or stops downstream of the length of the read (I.e. starts/stops in middle of read)
o Very challenging

46
Q

What are 3 ways Genes are predicted in metagenomes?

A
  1. Gene Fragment Recruitment (binning)
  2. Protein Family Classification
  3. de novo gene prediction
47
Q

What are the restrictions of gene prediction via gene fragment recruitment?

A
  • Can’t identify diverse homologs of known genes
  • Not appropriate for metagenomes from communities with genomes underrepresented in sequence databases [especially if ID of novel/highly divergent genes is desired]
48
Q

How are genes predicted using Gene Fragment Recruitment?

A
  • Map Metagenomic reads to database of gene sequences
  • Reads that are identical/near-identical to full-length genes are considered representative sub-sequences of the gene
  • If gene has functional annotation it can simultaneously provide functional annotation for matched Metagenomic read
49
Q

De novo Gene Prediction

A
  • Can potentially ID novel genes
  • GP models trained by:
    • Evaluating properties of microbial genes (length, codon usage, GC bias)
  • Used to asses whether a metagenomic read contains a gene
  • Doesn’t rely on sequence similarity to database
  • Allows you to ID genes in metagenome that share common properties with other microbial genes but may be highly diverged from any gene that has been discovered to date
  • Can be difficult to determine whether predicted gene is real or spurious which predicted gene is novel
50
Q

Gene Prediction via Protein Family Classification

A
  • Translates each read into 6 reading frames -> sequence alignment of resulting peptides vs database of protein sequences
  • Alignments analyzed to identify Msequences that encode translated peptides which exhibit homology to proteins in database
  • Not useful for identifying novel types of proteins
51
Q

How is Functional Annotation performed via Protein Family Classification?

A
  • Classifies predicted Mproteins (from gene prediction) into protein families
  • Because proteins in family share common ancestor they encode similar biological functions
  • If Msequence is determined to be a homolog of this family it is inferred it encodes the family’s function
  • Once Msequences has been compared to all proteins/models, can be classified into:
    o A single family [the family with the best fit]
    o A series of families [all families that exhibit a significant classification score]
    o No family: suggests novel, highly diverged or spurious protein
52
Q

What is a protein family?

A

Protein family = group of evolutionary related protein sequences/sub-sequences

53
Q

Why is functional annotation by Protein Family Classification imperfect?

A
  1. Functional diversity encoded in the metagenome may only approximate the comms functional activity
    a. Presence of a gene doesn’t mean it’s expressed at time of sampling
  2. Most databases contain families that have no known functional annotation
    a. Mreads homologous with these families will not be ascribed a function
    b. Can still be informative -> provide support for metagenomic coding sequence prediction & may be useful diagnostics
  3. Protein family database used might be subject to phylogenetic biases-> some comms disproportionately more accurately or thoroughly annotated than others
    a. Each database uses different approaches to IDing families & functionally annotating them
    b. Result is that different databases may annotate different proportions of the metagenome & produce different functional profiles
  4. Presumes that function is relatively evolutionarily static
  5. May be more proteins & functions in nature than that have been described by current sequence databases
54
Q

What is mariner transposon mutagenesis?

A

A forward genetic strategy for connecting phenotype to gene because stable random insertions can be generated in a recipient genome without specific host factors.

55
Q

What was the Method used in the Metagenomic Goodman study to identify genetic determince of gut microbiota?

A
  1. Generated transposon mutants
  2. Colonized into intestinal tract of mice
  3. Genomic DNA was purified from cecal contents (Cecum = pouch connected to junction of the small and large intestines)
  4. Digested with Mme1 - cleaves 16bp outside of transposon, capturing a genetic fragment that identifies the insertion site
  5. Separated by PAGE (poly-acrylamide gel electrophoresis)
  6. Transposon-sized fragments were extracted and ligated to a dsDNA adaptor bearing a 30-nucleotide overhang
  7. Page-purified adaptor-ligated library mols were PCR amplified for 18 cycles using a transposon-specific and an adaptor-specific primer
  8. 125bp product was purified by PAGE and sequenced
56
Q

What were the findings from the Metagenomic Goodman study to identify genetic determince of gut microbiota?

A
  1. Relative representation of mutants was consistent between the ceca of individual mice and reflected the abundance of most genes in the input population
  2. Mutants in 370 genes showed significantly altered representation in all 3 cohorts of mice
    o Largest category of genes identified in this screen encode hypothetical or conserved hypothetical proteins
57
Q

What is Emulsion Polymerase Chain Reaction (ePCR?)

A
  • Oil & aqueous mixture vigorously mixed to isolate individual agarose beads (each bead with individual unique DNA fragment hybridized to its surface
    o Isolated in aqueous micelles that also contain the PCR reactants
  • Micelles pipetted into wells of microtiter plate -> temp cycling produces > 1mil sequence-ready 454 beads
  • Each bead contains up to 1mil copies of original annealed fragment