Lecture 15 Flashcards
genome
the complete set of genetic material present in a cell or organism
genomics
the cloning and molecular characterization of entire genomes
a haplotype
The specific set of SNPs and other genetic variants observed on a single chromosome or part of a chromosome
linkage disequilibrium
The nonrandom association between genetic variants within a haplotype
tag-SNPs
The few SNPs used to identify a haplotype
Genome-wide association studies use
numerous SNPs scattered across the genome to find genes of interest
annotated (gene) which means
linking its sequence information to other information about its function and expression, the protein it encodes, and similar genes in other species.
Metagenomics is an emerging field in which
the genome sequences of an entire group of organisms that inhabit a common environment are sampled and determined.(eDNA)
Synthetic biology seeks to
design organisms that might provide useful functions
Functional genomics
characterizes what sequences do—their function
Genome content consist of
much more than just protein-coding genes
Intergenic sequences. → “non-coding” DNA
Repetitive sequences → short and long sequences that repeat in tandem or are interspersed throughout the genome
prokaryotic and eukaryotic genomes differ drastically in
size & organization
prokaryote - attached to cytosol (no organells, DNA not in nucleus)
eukaryote - genome in distinct chromosomes - tightly bound to proteins
Anatomy of a prokaryotic genome
1) single, circular chromosome
2) Single origin of replication (req. for DNA rep. machenerary)
3) Genomes are compact
→. ~1-10 million bases (Mb)
4) Most content is genic
→ Minimal intergenic DNA (non- coding)
→ few repetitive sequences
→ No introns
5) Genome size is directly related to gene content
→ larger genomes encode more proteins
regulatory consequences of organization of prokaryotic genome
Genes in biochemical or signaling pathways often clustered and controlled as operons
Chromosome not sequestered in nucleus
Chromosome not bound by histone proteins
→ No chromatin
Eukaryotic genomes
1) Genomes divided into multiple linear chromosomes, with telomeres & centromeres
2) DNA complexed with histone proteins (=chromatin) in a nucleus
3) Genome size tends to be much larger, and varies widely, even within a taxonomic group
→ Genes interrupted by introns
→ Copious intergenic DNA
→ Copious repetitive DNA
4) Genomes don’t tend to be compact
5) With rare exceptions, genes not clustered into operons
6) Many genes (most human genes) are interrupted by introns; genes are far apart
C-value is the
DNA content per haploid cell
→ think of this as genome size (how many bp)
G-value is the
protein-coding gene number
(amount of DNA seq corresponds to coding protein)
G-value paradox
Gene number does not fully correlate with organismal complexity
G-value paradox explained by
(1) alternative splicing
(2) expansion/contraction
alternative splicing explanation for G-value paradox
Multiple exons from one gene can be spliced in different ways (=alternative splicing) to form distinct mRNAs and proteins
No. of proteins»_space; no. of protein coding genes
Explains smaller-than-expected gene count in multicellular spp.
expansion/contraction explanation for G-value paradox
Gene expansion & contraction is frequent, even among closely related spp.
gene duplication
family duplication
entire genome duplicated
C-value paradox
Genome size doesn’t fully correlate with organismal complexity
C-value paradox explanation
expansion of non-genic DNA, largely repetitive DNA
> 85% of human genome is repetitive DNA → caused by interspersed transposable elements → non-autonomous, non-coding transposable elements
Assembly of eukaryotic genomes is
very challenging
Human genome is 3,200 Mb (million bases; =3.2 Gb) with large amounts of repetitive DNA.
Technology not up to the task: Sanger sequencing
Draft human genome reference assembly
Sequencing is only the beginning, resulting in multiple millions of “reads”
Assembly → sequencing reads must be put in order on chromosomes (we are skipping this aspect)
Draft assembly → unfinished, with lots of “gaps”
Reference assembly → the assembly (usually a working model) is used as a framework to guide interpretation of individual genome variation and functional genome analysis
HGP brought about
radical technological changes in genetics
and
radical conceptual changes in genetics
technological changes in genetics brought about by HGP
High throughput, massively parallel, genome-wide data collection and functional assays
Sequencing efficiencies
→ 1 human genome: from 10 years and 2.7 billion (1990-2003) to 1 day and <$999 in 2019
Concomitant strides in computing power and analysis software
conceptual changes in genetics brought about by HGP
Humans are more variable than we thought
Humans have far fewer protein-coding genes than we thought…
…yet, most of the genome is transcribed → a lot of RNA not turned into proteins
Cells are full of noncoding RNAs
Initial predictions before the Human Genome Project were ____
Current estimate is ___
~200,000
~20,000 or less
caveat: we need to reconsider how a “gene” is defined, as we will see later in the course
Functional Genomics
how to go from DNA to what do
Genome controls phenotype through
transcription
We expect that functional elements in the genome should be
1) transcribedor
2) bind proteins that regulate transcription
Bioinformatics involves ___
which can ____
using computer technology to collect, store, analyze and disseminate biological data and information
can increase our understanding of health and disease and, in certain cases, as part of medical care.
Homologous genes
Genes that share a common evolutionary origin. Likely to have conserved sequence and function.
Paralogs
Homologous genes in the same species.
e.g. alpha and beta hemoglobin in humans.
Orthologs
Homologous genes in different species.
e.g. mouse and human alpha hemoglobin
Predict function from sequence
how closely related to other genome (ex. SARSr-CoV)
Comparative genomics
field of genomics that studies similarities and differences in gene content, function, and organization among genomes of different organisms
Transcriptome
All RNA molecules transcribed from a genome
Transcriptomics
Techniques used to identify and quantify the transcriptome.
protein domains
Complex proteins often contain regions, called
that have specific shapes or functions
(ex. zinc finger)
RNA-seq
Transcriptomics
identifies all transcribed elements
→ extract all cellular RNA
→ transcribe → cDNA
→ chop up and add adapters → sequence
Relies on next generation sequencing and bioinformatics
Microarrays
Transcriptomics
Can be used to determine relative levels of mRNA (i.e. expression levels) for 1000’s of genes.
Employ an array of complementary probes that are complementary to mRNA sequences.
Proteome
All proteins encoded in a genome.
Proteomics
Techniques used to identify and quantify the proteome.
Mass spec
Proteomics
is a high throughput method to identify proteins in a cell
→ digest proteins into peptides
→ separating fragments by mass-to-charge ratio
→ match peak profiles to a database of known proteins
ChIP-seq
(Chromatin ImmunoPrecipitation)
Proteomics
(affinity capture)
identifies DNA bound by known DNA-binding proteins
→ e.g., transcription factors (TFs), RNA pol
antibodies bind to specific protein → take genomic DNA → mix with antibody → bind to protein (that is bound to DNA) → can pull complex out of solution and seq DNA bound by that protein
Requires specific antibodies
→ need to know what protein looking for (and have antibody for it)
high throughput sequencing
two-dimensional polyacrylamide gel electrophoresis
(2D-PAGE), proteomics
in which the proteins are separated in one dimension by charge, separated in a second dimension by mass, and then stained
Protein Microarrays
Employ ___
Can be use to ____
Proteomics
Employ an array of proteins immobilized on a solid support.
to identify protein-protein interactions or measure expression of proteins within cells (using immobilized antibodies).
Modifications of affinity capture and other techniques can be used to ____ termed the_____
determine the complete set of protein interactions in a cell,
interactome.
Genome-wide mutagenesis screens
can be used to search for all genes affecting a particular function or trait.
two methods—random inducement of mutations on a genome-wide basis and mapping with molecular markers—are coupled and automated
segmental duplications,
duplicated regions greater than 1000 bp that are almost identical in sequence.
Many eukaryotic genomes, especially those of multicellular organisms, are filled with
multigene family is a
group of evolutionarily related genes that arose through repeated duplication and evolution of an ancestral gene.
gene deserts
(genetically engineered mice that were) missing large chromosomal regions with no protein-encoding genes
collinearity
many genes are present in the same order in related genomes
pangenome
the entire set of genes possessed by all members of a particular species.
single-nucleotide polymorphism (SNP)
A site in the genome where individual members of a species differ in a single base pair