Week 3 Flashcards
Genome
all DNA and identification of all DNA elements
Transcriptome
All transcripts expressed (list plus analysis of expression)
Proteome
all proteins expressed (list plus analysis of modification)
Bacterial Genome to bacterial proteome
Large scale ORF; Looking for open reading frames in the bacteria.
Take the protein files and build a database which we would call the proteome of the bacteria.
Simple for bacteria because of the fact that DNA contains the coding region that is not interrupted.
So you can go from DNA to the protein coding capacity of that DNA very simply.
Bacterial Genome to bacterial proteome
Large scale ORF; Looking for open reading frames in the bacteria
Eukaryotic genome to eularyotic proteome
Large scale ORF finder we can’t go directly from the genome to the proteome.
Because of splicing.
we need the transcriptome to get to the proteome
What is the Transcriptome?
All expressed RNA
major problem with annotating the genome particularly determining the proteome
A major problem with annotating the genome particularly determining the proteome (total coding capacity of the genome), is the modification that RNA undergoes in eukaryotes
Eukaryotic mRNA
Extensively processed
5’ prime cap (sevenmethylguanine, sugar, 5’ to 5’ triphosphate)
AUG first ORF codon.
Presence of the poly-A-tail helps us annotate the proteome.
Reverse transcriptase
The DNA copy is made reverse transcriptase which requires a DNA primer.
Use an oligo dT primer that hybridizes with the poly-A-tail.
Therefore the total transcriptome is not represented.
Before nanopore this was the only way to sequence rna.
Proteome comes form
Translation of mRNA.
Another processing event that causes a large amount of problems with annotation is
RNA polymerase transcribes DNA into the primary RNA transcript, that is not the translated transcript.
Intronic sequences removed through splicing.
Those intronic sequences are removed and we have the production of a mature messenger RNA
cDNA is simply a
complementary copy of the mature mRNA after the intronic sequences are removed.
A large amount of the genome is not expressed
intragenic regions which are not transcribed and intronic regions that are transcribed and spliced out.
Spliced cDNA sequence
A DNA sequence of the expressed sequence that sequence will align to the genomic sequence in a broken pattern.
determine what the cDNA sequence encodes to send this information to the proteome databse.
we can’t easily get this information from the genomic sequence because the sequence is encode within these three exons and is seperated from one another.
Alternative Splicing
Genes undergo alternative splicing.
When you align different cDNA sequences to the genome you find that some genes that these alignment are quite different from one cDNA to another.
Indicating that they came from transciripts that have undergone alternative splicing
This gene produces six distinct messenger rna transcripts.
That encode three distinct polypeptides.
When you align this sequence to drosophola DNA you ifnd six different patterns of alignments due to six different splicing patterns of the mRNA transcripts.
Alternative splicing benefit
increases the number of proteins that can be encoded by a single gene.
one gene that produces one premessenger RNA that can be spliced in multiple ways, all three exons are included in the mRNA, but one of 12 exon 4 can be added.
Alternative poly A sites
Different poly A sites result in Exon 1 be spliced to either exon 2 or 3.
3’ end is cleaved at alternate positions.
Alternative promoters
Results in certain exons being included or excluded from transcription
Exon included/excluded/mutually exclusive
exon can be included or excluded
you can have one exon or the other just not both
Alternative 5’/3’ splice site
Earlier or later splicing at the 5’ or 3’ end.
Retained Intron
In some messages splicing occurs such that the intron remains in the mature mRNA.
Major goals of RNA seq Analysis
Count the relative number of transcripts in the sample.
Determine the structure of the transcripts in the sample.
Differential Cell Expression
Distinctive set of mRNA expressed by a cell.
Goals of sc RNA seq
- Deterine the poly A+ transcriptome of individual cells.
- Useful in the study of development and human disease
- Determine cell type by determining the genes and the phenotype/function they perform
sc RNA sequencing
- Suspension of cells + microparticle + lysis buffer are mixed in a microfluoridics apparatus with oil.
- Oil encapsulates one cell and microparticle into a droplet.
- Cell is lysed by lysis buffer
- Polyadenylated RNA is hybridized to the oligoT primer on the microparticle
- Barcoded primer bead: PCR handle/cell barcode/UMI
- Reverse transcribe using mRNA as the template.
- Amplify the STAMPs through PCR and then sequence the (paired end read)
- Align cDNA to genome to determine the genes present
- Group results by cell
- Count unique UMIs for each gene in each cell
Changing pattern of gene expression development
When you start off as a single cell you have one transcriptome, but as the cells specialize during developement you start to ge epxression od different patterns of genes in each cell.
Proteome
Catalogue of proteins expressed by an organism
What proteins are unique to the organism
What is the function of the protein
What Proteins Are Unique to the Organism or shared?
All life is related therefore many genes are shared.
Shared genes
Homologous genes
Orthologs
Homologous genes between different species.
Have the same common ancestor.
Paralogs
Homologous genes within the same species.
Result of a duplication of a gene.
What is the function of the proteome?
if an orthologous protein is well characterized in one organism then it may be reasonable to propose that all the orthologous proteins share this function.
Conserved domain
Can suggest the biochemical function of the protein in the proteome,
Conserved domain
Can suggest the biochemical function of the protein in the proteome,
Interactome
determine all the interactions between protein.
Interactome
An interact is the result of a systematic analysis of the proteome.
Two Major methods
Yeast two hybrid screen
Affinity purification and Mass spectomery
Yeast Two Hybrid Screen
- Completely genetic Method of detecting interactions passed on the transcription factor GAL4
- GAL4 binds to a UAsgal4 and can drive gene expression of reproter gene (LacZ encodes beta-galactosidase)
- Results in blue yeast cells
- GAL4 is made up of two domains (DNA binding and activation domain)
- To get expression of the lacz reporter gene the DNA binding domain and the activation domain have to be in the same molecucle. ( both are necessary)
Yeast two hybrid screen check
Seperate the DNA binding domain from the DNA activation domain.
Fuse the DNA binding domain to the bait protein.
Activation domain is fused to the prey protien.
If the two proteins do not interact the reporter gene is not expressed colonies are white.
In the case of interaction we get expression for the reporter gene resulting in the blue colony.
Affinity Purification and Mass spectometry
Set up strains which express one protein with an AP tag.
Purify protein attached to the tag (tag is genetically fused to the protien)
Thousands of proteins are fused to the AP tag.
Purify protein that is attached to the AP tag if it’s attahed to another portein in a protein complex then that protein will be purified along with it.
How can genomes vary?
Genome Size
Genome content/genome number
Genome structure
Genome structure variation
Type
Shape (can have a complex mixture of linear and circular genomes)
Number of pieces
Chromosome number
Varies alot even within a species (fusion) or individuals.
Why does genome structure matter?
Advantage of circular genomes; start from one point and around.
Linear genomes are hard to replicate because they have telomeres, 3’ ends are difficult we get shorter chromosomes.
Exteneded Telomeres
Telomerase is an RNA template which is used to provide the sequence for eleongaring the ends, created a set of repeats.
Another solution to shortening of DNA during replication
Chromosome ends are enclosed single stranded loops. Helps overcome end replication problem and protects telomere.
Another solution to shortening of DNA during replication
Chromosome ends are enclosed single stranded loops. Helps overcome end replication problem and protects telomere.
Genome Size
Total length of genetic instruction with a biological compartment.
Units of Genome Size
Nucleotide (Single stranded) Base (single and double stranded) Base pair (double stranded)
What is genome size?
One full haploid set, never measure duplicated information.
What is genome size?
One full haploid set, never measure duplicated information.
Complexity Bacteria/Virus/Humans
There are many different genome sizes, consider wherether the simplicity of viruses and complexity of hymans is reflected in genome parameters
Genome Size and Complexity
Complexity is associated with either genome size or gene number.
Genome size and the number of genes (bac and arch)
of genes in bacteria and archea vary depending on the genome being exmaines.
linear relationship between the number of genes and genome size
Biggest Genomes (Eukaryotes)
Plants have immense variation in genome size.
So do single cell amoeba.
Genome size is not associated with complexity.
Immense amount of variation at the level of organelle and virus genomes.
Eukaryotic microbes are immensenly variable.
Genome size is not associated with complexity
What do big genomes have that little genomes do not have (discrepancy between size and complexity)
Non coding DNA
Difference in gene (coding) to base ratio
As genome size increases the number of noncoding bases increase.
only conding sequence ends up in mature mRNA.
Lungfish and parasite
Lungfish (130 Mbp): 99.9% non coding
parasite (2 Mbo) : >90% coding
% Non coding
Varies within a genome
Chromsome 19 has 25 genes per Mbs.
Y has 2 gene per Mbs.
Why lots or little non-coding DNA
Non coding DNA occasionally has a function.
Some organisms who are in a race to replication lose noncoding sequence in ording to replicate more quickly.
Mobile DNA has spread throughout the genome increases its length and the number of non-coding genes.
Complexity is
Not reflected in teh number genes (some genonems might be duplications of the same gene)
No vast increase in the number of protein domains with the number of genes in the genome.