eukaryotic gene prediction Flashcards
1
Q
position weight matrices
A
- line up all possible nts at each position and score
- common nt at that position → high score
- search sequence with PWM to find highest scoring sequence
- sum scores to get score for sequence as potential site
- above threshold indicates functional site
- MSA to create PWM
- any functional site
- species specific
- low specificity (multiple transcripts and mechanisms unknown)
2
Q
intron splice sites
A
- PWM:
- almost always GUAG
- combine with surrounding patterns
- generally similar consensus in vertebrates
- polypyrimidine tracts
- upstream of 3’ end in higher eukaryotes
- yeast introns:
- invariant upstream sequence
3
Q
hidden markov models
A
- like PWM but considers previous and next base
- takes into account gaps
- looks at overall pattern
- at each position, probability of:
- insertion, deletion, match (transition)
- each base (output)
- move through each position and multiply probabilities
- pseudocounts
- idea that sequences can have same function but still vary
4
Q
features predicted by HMM
A
- gene structure
- exon/intron lengths
- nt composition
- motifs
- start/stop codons
- splice sites
- patterns of conservation
5
Q
genscan
A
- uses known genes to creates training sets
- HMM based
- species/taxonomic specific gene models
- search for unknown query sequences
- focus on GC content
- gene density and exon/intorn length
- alter parameters depending on GC content
6
Q
genscan HMM
A
- start in intergenic region (N state)
- then promoter (P)
- 5’ UTR (F)
- single exon gene (Esngl) or first exon of multi-exon gene (Einit)
- 3’ UTR (T)
- polyA tail (A)
- return to N
- forward and reverse strand
7
Q
genscan intron/exon states
A
- 3 intron states follow einit depending on frame
- 3 exon states follow intron states
- probability of moving to eahc state based on training data
8
Q
exon size distribution
A
- normal distribution
- differs between initial, internal, terminal
- internal - steep size drop off after 300bp
- length distribution functions can be used
- introns have a minimum size and geometric distribution
9
Q
MDD
A
- maximal dependence decomposition
- captures interdependencies of non-adjacent nucleotides
- splice sites
- use dependencies to search sequence and match to MDD tree
- move from position to next dependent position
- probabilities of each position
10
Q
weight array model
A
- weight matrix that captures interdependencies
- only used for splice sites
- all other features have WMM
- all amtrices combined for identification
11
Q
genscan promoter identification
A
- 30% of promoters have no TATA box
- split prediction model according to this and use weightmodel
- TATA:
- 0.7 probability
- 15bp TATA WMM and 8bp cap site WMM
- TATA-less:
- 0.3
- intergenic-null regions of 40bp
- genscan doesn’t require promoter identification for gene prediction
12
Q
homology gene prediction
A
- complements ab initio
- BLAST search against swissprot and EST data from similar species
- match to EST confirms exon
- >90% homology needed as high error rates
- swissprot - additional confirmation
- match to EST confirms exon
- pool to get composite picture
- genscan may find additional exons even if gene identified otherwise
13
Q
phylogenetic footprinting
A
- can be used for promoter prediction
- look at same gene in multiple species with common ancestry
- expect conserved upstream region (promoter)
- coexpression of genes in the same species (RNAseq)
- better if more distantly related
- greater mutaiton outside regulatory region
- conservation more prominent
- balance
14
Q
repeats
A
- simple:
- microsatellites
- polypurine/pyrimidine tracts
- complex:
- LINES/SINES, LTRs, Alus
- exact sequence is polymorphic - difficult to identify
- ENCODE - unfinished
- assign function to all human genome elements