ab initio gene prediction Flashcards
1
Q
bacteria vs eukaryotes
A
- much easier for bacteria
- no introns (single coding region)
- smaller intergenic regions
- genes easier to find
- 2-3% eukaryote genome is genes
- look for largest ORFs
- accurate for low GC
- high GC → fewer stop codons
- many ORFs will be by chance
2
Q
gene finding programs
A
- artemis - widely used
- prodigal
- doesn’t just look for ORFs
- log-likelihood information
- accuracy >90%
- performs well with high GC
3
Q
prodigal
A
- create training set for protein-coding regions
- look for G/C bias at each position of ORFs
- build model of predicted ORFs with positional bias
- dicodon bias also used
- penalise ORFs downstream of another larger ORF
- difference between 2 scores removed from smaller ORF score
- add length factor to each
- higher in genome with lwoer GC
- iteration
- dynamic programming
4
Q
log-likelihood
A
- dicodon bias in exons vs introns
- store statistics and look at random chance of them appearing
- score as log-likelihood of signal to background for each potential gene
5
Q
prodigal iteration
A
- on sequences with coding score above threshold
- store initiation site with highest score for each ORF
- exmaine starts for ATG/GTG/TTG frequency and RBS/SD
- rescore
- new set of starts with highest score in each ORF selected
- continue iterating and refining
6
Q
prodigal dynamic programming
A
- performed over all start-stop pairs
- score each gene on start and dicodon scores
- allow some overlap
- opposite strands particularly
- smaller overlap on same strand
- determine final gene prediction
7
Q
eukaryotic gene prediction
A
- introns and exons
- variable number and length
- multiple transcripts
- mechanism not fully understood
- look for enhancers etc.
- large intergenic regions
- need to identify functional sites and patterns around them
- PWM