Gene Finding Flashcards
Week 2 Lecture 1
Describe the eukaryotic genome structure.
- Exons and introns that make up the ORF
- The promoter is the nucleotide sequence to which RNA polymerase binds to initiate transcription
- Initiation codon (usually ATG)
- Termination codons (TAA, TAG, TGA)
- LINEs, SINEs, LTR elements
What is TRY4?
- A gene that is responsible for the synthesis of trypsinogen (trypsin precursor)
- Discontinuous gene (split into 5 exons and 4 introns)
What are V28 and V29-1?
- Discontinuous gene segments that specify part of the beta T-cell receptor protein
- Not complete genes and must be linked to each other by gene splicing
- Results in a permanent genome change during cell differentiation
What does gene finding aim to find?
- Which regions code for proteins
- Gene start and end regions
- Exon/intron boundaries
- Regulatory sequences
Describe prokaryotic genome finding.
Prokaryotes have small genomes (0.5-5 Mbp) with a high coding density (>90%) and no introns. This makes gene identification relatively easy (~99% success rate). Problems include overlapping ORFs (due to the prokaryotic genome being so small), short genes missing the cutoff point (roughly 50 aa), and finding promoters.
Describe eukaryotic genome finding.
Eukaryotes have large genomes (10-120,000 Mbp) with a low coding density (<50%, 2-3% in humans), and they contain introns. This makes gene identification relatively difficult (~50% success rate). There are many problems in eukaryotic genome finding.
Methods of gene finding
- Ab initio methods
- Similarity-based methods
- Integrated approaches
What are Ab initio methods?
Making predictions based on typical gene features such as splice signals and sequence composition. Regions to look for include initial 5’ exons, internal exons, and final 3’ exons.
What are similarity-based methods?
- Predict genes by adding information from the whole genome sequence from related species
- The similarity between the query sequence and the known coding sequence are used to infer gene structures
- Allows predicted exon sequences located by ORF scanning to be tested for functionality
- Results are influenced by the availability of close homologues
Problems with similarity-based methods
- Different genes are expressed differently in different tissues so you can’t just sequence mRNA from a random tissue, you need to sequence everything under a lot of different conditions
- Might be different due to alternative splicing
ORF scanning in prokaryotes
- Search DNA for start and stop codons (may end up with false positives)
- Search for promoters
- Since most genes are >50 codons you apply a cutoff of ~100, but you also lose the short genes
ORF scanning in eukaryotes
- Search DNA sequence for start and stop codons
- Search for promoters
- Search for intron/exon boundaries (spliced mRNA does not contain introns)
- Since most genes are >50 codons you apply a cutoff of ~100
Codon usage in genomes
- Codons are used unequally, this is a universal feature of genomes
- You can use this to differentiate coding and non-coding regions (e.g. humans use GTG 4x more than GTA for valine)
- Real exons are expected to show codon bias but introns should not
ORF scanning and moving windows
Sequence information only is used to identify coding exons through integrating coding statistics. We want to calculate the likelihood that a triplet is in a coding region and plot a graph of it (above zero is likely below zero is unlikely)
ORF scanning: exon/intron boundaries
Exon-intron boundaries have distinctive sequence features.
- Upstream boundary: invariant GT and consensus sequence
- Downstream boundary: T or C, any amino acid, then CAG