M3 L19: Genomics and sequencing Flashcards
what is a genome
full haploid seq of DNA in a species
what can genomes tell us
inform understanding of gene function, inform understanding of evolution, inform understanding of microbial ecology for unculturable microbes
what are the 2 original methods for seq a genome
clone by clone
shotgun sequencing
what’s the clone by clone approach? pros and cons?
break a genome into large frags via partial restriction digest –> insert in a large vector (BAC) and clone –> break large frags into small frags –> subclone small frags in plasmids and sequence –> assemble chromosome
pro: reliable
cons: slow, cost ineffective
what’s the whole genome shotgun approach? pros and cons?
break genominc DNA into small fragments –> seq everything at high coverage –> assemble overlapping sequences
pros: cheap
con: less accurate / assembly is harder bc genomes are redundant
what are the modern genome sequencing techniques
1) illumina
2) pac-bio
3) oxfofrd nanopore
when to use illumina? pros and cons?
pros: cheap per bp, low error rate 1%
cons: short reads 150-250 bp (can’t sequence genome de novo)
when to use pac-bio sequencing? pros and cons?
can seq genome de novo
pros: longer reads 15 kb+
cons: more expensive and higher error rate 10% but errors are random –> take the consensus of all seqs to figure out the correct base
when to use oxford nanopore? pros and cons?
can seq genome de novo
pros: very long reads 30 kb+
cons: also very expensive and high error rate 10%; errors are systemic, not random so can’t correct by sequencing more reads
best method for sequencing genome de novo?
mix of short and long read techniques
long –> assembly
short –> accuracy
what is GWAS do you have to assemble the whole genome?
genome wide association study
don’t have to assemble genomes, just sequence all genomes and compare diseased and non diseased to reference
what is an example of the evolve and resequence technique? what’s a chemostat?
evolve two species of yeast in sulfate limited conditions –> sequence their genomes before and after –> mutations that reoccur frequently are likely adaptive
a chemostat is an apparatus that continuously adds new and removes old growing media
what other species were sequenced as part of the human genome project
drosophila melanogaster
mus musculus
c. elegans
saccharomyces cerevisiae
E. coli
what is metagenomics? what’s it used for?
sequencing all genomes in an environmental sample to determine what species are present (especially for unculturable bacteria
what is reverse ecology
sequence environmental sample and identify most optimized genes –> can infer those are the ones that are most important for survival
what is the great plate count anomaly
when you culture an environmental sample, there are way less colonies than actual microbes in the sample
what is the baas becking hypothesis
NS is the strongest force in determining microbial ecoloty
everything could live anywhere but the environment selects
support for and against the baas becking hypothesis for fish in a lake
for: if different fish in the same lake that eat different things have different microbiomes (dif microbiome bc dif food = dif enviro)
against: if fish in the same lake that eat different things have the same microbiome (would mean microbes can’t physically disperse)
what is gene annotation? how do you do it? con to this approach?
determining if a sequence is a gene
look for reading frames that code for more than 50AAs bc that is uncommon for random seqs
con: some functional seqs that code for proteins are less than 50AAs
2 types of annotation?
structural: locate genes
functional: locate genes and determine their function
how many reading frames does each sequence have
6
how to determine gene function from sequence?
genes with similar sequences usually have similar functions and belong to the same “gene family”
how to new gene families arise? what are the 2 possible consequences?
duplication
homologs from duplication –> paralogs
homologs from speciation –> orthologs
what is exon shuffling? what is it trying to explain? why is it maybe inaccurate?
exons are inserted into dif protein seqs –> give protein that function
possible way to get new genes
would mean that proteins are highly modular but this is probably unlikely bc inserting a different exon would change the protein folding and function, likely in a LOF way
what are pan and core genomes
pangenome: any gene in any member of that species
core genome: set of genes in all members of that species
what is the c-value paradox? is it really a paradox?
complexity does not correlate with genome size
not actually a paradox bc genome size does not indicate number of genes, it indicates number of transposable elements
onions have 5x genome size of humans
3 points to remember
1) genome evo can be rapid bc of transposons but phenotypic evo can still be slow
2) fwd genetic screens only identify 1/2 of the genes in model organisms
3) gene numbers are lower than originally thought
2 drawbacks to relying on phys chars for phylogeny
1) can’t observe microbes
2) may lead to inaccurate conclusions bc of convergent evolution (for same fxn not due to comm ancestor)
why can we infer phylogeny fron sequencing
seq divergence is linear with time
2 ways to find functional noncoding sequences (regulatory or RNA gene regions)
1) phylogenetic footprinting: sequence distantly related species and look for highly conserved ones
2) phylogenetic shadowing: sequence closely related species and look for ones that are conserved in all
why articulated how gene duplication could drive evo innovation (especially big transitions like invertebrates to vertebrates)
susumo ohno
why does duplication allow for evolutionary evolution
NS doesn’t tolerate mutations in functional genes but duplication means one copy can be mutated
3 main consequences of gene duplications
1) pseudogenization: one copy gets mut that inactivates it
2) neofunctionalization: one copy gets a mutation that gives it an additional function
3) subfunctionalisation: one gene has 2 functions and each paralog specializes for one function; loses other
what is functional genomics? what categories are contained
perform experiments on a genome wide scale
transcriptomics (sequence all mRNAs in cell)
proteomics (seq all proteins in cell)
phenptypic screen on knockout/deletion collections
examples of transcriptomics
1) hybridize flourescent cDNA to microarrays of known DNA seqs –> lots of fluorescence = lots of mRNAs
2) directly sequence cDNA –> most abundant cDNA = most abundant mRNA
how can deletion collections help to infer gene function?
examine growth in many different conditions –> 80% won’t have any change in phenotype in rich media but 97% will have some defect in specific conditions
Design an evolve and resequence study for antibiotic resistance genes in a species of pathogenic bacteria. What would you look for in the sequencing data to determine if a gene was causal?
Put the bacteria in a chemostat with growing media and the antibiotic. Sequence the genome before and after starting the experiment. Compare the genomes and look for changes. If the same change was present in a lot of bacteria that survived the antibiotic, then it is almost definitely causal.
What are the two ways we can perform transcriptomics? What does transcriptomics tell us?
One method is microarrays: obtain oligonucleotides for every gene in a cell. Reverse transcribe fluorescent cDNA from mRNAs in the cell → cDNA libraries for before and after changing the conditions. Hybridize the cDNAs to the complementary DNA/oligonucleotides and observe where there is a lot of fluorescence → indicates which mRNAs are present in the most copies
Another method is RNA/cDNA sequencing: reverse transcribe mRNA to make cDNA. See which sequences have the most reads
Transcriptomics tells us the relative abundance of each mRNA in a cell/if a cell responds to changing conditions by up-regulating transcription.
Suppose I perform a transcriptomics and a proteomics study on a certain genotype of yeast cells growing in rich media. I note that on average protein levels correlate with mRNA levels, but the correlation is not perfect. What do I infer from the outliers? Does this have implications for other transcriptomics studies that do not perform proteomics? Explain.
The outliers might have lower translation/protein levels than expected given the amount of transcription/mRNAs present in the cell. This might be because of mRNA interference. For example, miRNAs or siRNAs can bind to RISC, which then binds to the mRNA after transcription and either destroys the mRNA or prevents translation.
This has implications for transcriptomics studies that don’t include proteomics because they might lead to inaccurate inferences. You may assume a gene codes for an important protein because there is a lot of the mRNA present, when in reality, the mRNA doesn’t get translated in high quantities.
Explain how the yeast deletion collection can be used to identify groups of genes involved in similar pathways.
Yeast deletion collection: collection of thousands of yeast strands that each have a mutation in one gene. Expose them to some experimental condition and observe which strains have a change in abundance. The ones that change have mutations in genes that are in similar pathways and are important for survival in the experimental condition.