Genome annotation Flashcards
What are the objectives of genome annotation?
- Identify all of the possible genes and features within the genome
- Assign functions to as many genes as possible - This should include variable splice sites that may mean the gene has multiple products
- Identify and describe regulatory features and their functions
- Identify all other features, including repeats
- comparative studies between genomes
What are some issues you can face during genome annotation?
Annotating genomes is a computationally intensive process, both in terms of processing and data storage
What are some characteristics of genomes that might lead to errors during the annotation process?
- Genomes are extremely large
- Virtually all features are context-specific
- Majority of the sequence inside a genome may not correspond to any known
- introns
- completely putative coding regions
- pseudogenes
What is annotation by simple metrics?
this is the simplest level of annotation and provides general information about the genome
What are some things you can annotate with simple metrics annotation?
- base confidence scores
- rolling window G+C%
- Di and tri nucleotide composition bias - how often do you get two nucleotides one after the other; are they distributed like that by chance or maybe the distribution is not that likely just by random distribution; any attern tells you that there is a selective pressure on that sequence and it likely does something
- codon use
What is the average GC content in the genome and in genes (in humans?
- approx 38% - but thats an average and can vary significantly throughout the genome
- In genes the GC is approx 45-50% and more uniformly distributed than in the genome.
What is the relationship between the GC content and gene density?
Regions of high GC content (62-68%) have higher relative gene density than regions of lower GC content.
What is the relationship between exon and intron length and the GC content?
- Exon length is relatively uniform with respect to GC content
- Intron length decreases dramatically in regions of high GC content:
- If GC content is around 30%, average intron length is 2300 bases
- If the GC content is around 65%, average intron length is 300 bases
- marker to where the genes are gonna be
What are some other characteristics we can measure apart from the GC content?
- Most simple is dinucleotide composition/bias - %AT vs. %GC
- Dicodon counts can be more informative - (frequency of occurrence of successive
codon pairs) AA, AC, AG, AT, CA, CC etc. - Measure of 3rd base periodicity (tendency of same nucleotide to be found at particular distances e.g. n, n+2, n+5, n+7
- Length and occurrence of Open Reading Frames (ORFs) - stretches between stop codons
What is an ab initio prediction?
prediction just from the sequence
Why is gene prediction easier in bacteria than in eukaryotes?
- No introns
- Smaller intergenic regions
How do bacterial gene finders work and is it a reasonable assumption?
- Bacteria gene finders often just look for the largest open reading frames (ORFs), those above a certain size (approx. 300bp) and consider them to be real genes
- This is a reasonable assumption in genomes with a low GC contents
- It is a problem for those with high GC contents
Why is looking at the length of ORFs not enough in genomes with high GC contents?
- With a lack of A and T in these genomes there are far fewer stop codons
- Long ORFs occur simply by chance in high GC genomes, many of which are not genuine genes
What is Prodigal?
- Prodigal uses additional log-likelihood information to predict genes and not just ORF length
- A relatively simple system that follows principles of KISS (Keep It Simple, Stupid)
- High accuracy (>90%)
- Good prformance in high GC genomes: Over a 90% perfect match (5’+3’) to the Pseudomonas aeruginosa curated
annotations
What are the steps in Prodigal?
- Constructing of a training set fro protein coding
- Building log-likelihood coding statistics from the training data
- Sharpening coding scores
- Length factor too coding
- Iterative start training
- Final dynamic programming (trying every possible combination)
Describe 1. Constructing of a training set fro protein coding
(prodigal)
- Prodigal examines all ORFs and looks for a bias for G or C in the 1st, 2nd, and 3rd positions of each codon
- It then builds gene models across the whole genome using this frame plot bias.
Describe 2. Building log-likelihood coding statistics from the training data
- There is a preference in exons compared to introns for particular 6-nucleotide sequences (dicodons). Prodigal stores dicodon statistics for all the genes in its initial model. Scored as a log-likelihood (how often do you find them and how often would you find them if they were distributed by chance) of signal to background.
- we are looking for biases - if you find a bias then you should expect to see the same bias in all open reading frames
- Every potential gene in the genome (all possible starts and stops) is scored.
- These 2 steps are all Prodigal uses in the way of coding scores.
Describe sharpening coding scores
- In the next step it penalizes all potential start candidates that lie downstream from a higher-scoring start. A penalty is assigned representing the bypassing of a good coding region.
- longer ORFs have more credibility
- Example: gene 3701-4000 has a score of 100 and gene 3763-4000 has a score of 75. The score of gene 3763-4000 is adjusted to be to be 75 minus the coding not selected (25), therefore 50.
Describe Length factor to coding
A length factor is added to the coding score. This factor is higher in high GC genomes, and lower in low GC genomes.
Describe iterative start training
- For every open reading frame containing a gene with a coding score above a certain threshold, the translation initiation site with the highest coding score is recorded
- These starts are examined for ATG/GTG/TTG frequency and ribosomal binding site (RBS / Shine-Dalgarno) motifs.
- The starts are then rescored based on these discoveries, and the new set of starts with the highest score in each ORF is selected
- Shine-Dalgarno - ~10 nucleotides upstream from the start codon (consensus TAAGGAG) – corresponds to a pentamer in 16S rRNA near 3’ end
- The start trainer iterates until the set of “best starts” no longer changes (usually only a few iterations).
- This final set of “best starts” is used as the training set for start scoring
Describe final dynamic programming (trying every possible combination
- Final dynamic programming is performed over all start-stop pairs in the genome
- Each potential gene’s score is the sum of its start score and its coding score
- Small overlap allowed between two genes on the same strand, and a greater amount of overlap is allowed for 3’ ends of two genes on opposite strands
- Final predictions determined
What genome does Prodigal use for training?
it only uses the current genome it analyses for training! not external data!