eukaryotic gene prediction Flashcards

Question 1

Q

position weight matrices

Answer

A

line up all possible nts at each position and score
- common nt at that position → high score
search sequence with PWM to find highest scoring sequence
- sum scores to get score for sequence as potential site
- above threshold indicates functional site
MSA to create PWM
any functional site
species specific
low specificity (multiple transcripts and mechanisms unknown)

Question 2

Q

intron splice sites

Answer

A

PWM:
- almost always GUAG
- combine with surrounding patterns
- generally similar consensus in vertebrates
polypyrimidine tracts
- upstream of 3’ end in higher eukaryotes
yeast introns:
- invariant upstream sequence

Question 3

Q

hidden markov models

Answer

A

like PWM but considers previous and next base
takes into account gaps
looks at overall pattern
at each position, probability of:
- insertion, deletion, match (transition)
- each base (output)
move through each position and multiply probabilities
pseudocounts
idea that sequences can have same function but still vary

Question 4

Q

features predicted by HMM

Answer

A

gene structure
exon/intron lengths
nt composition
motifs
start/stop codons
splice sites
patterns of conservation

Question 5

Q

genscan

Answer

A

uses known genes to creates training sets
- HMM based
species/taxonomic specific gene models
- search for unknown query sequences
focus on GC content
- gene density and exon/intorn length
- alter parameters depending on GC content

Question 6

Q

genscan HMM

Answer

A

start in intergenic region (N state)
then promoter (P)
5’ UTR (F)
single exon gene (Esngl) or first exon of multi-exon gene (Einit)
3’ UTR (T)
polyA tail (A)
return to N
forward and reverse strand

Question 7

Q

genscan intron/exon states

Answer

A

3 intron states follow einit depending on frame
3 exon states follow intron states
probability of moving to eahc state based on training data

Question 8

Q

exon size distribution

Answer

A

normal distribution
- differs between initial, internal, terminal
internal - steep size drop off after 300bp
length distribution functions can be used
introns have a minimum size and geometric distribution

Question 9

Q

MDD

Answer

A

maximal dependence decomposition
captures interdependencies of non-adjacent nucleotides
- splice sites
use dependencies to search sequence and match to MDD tree
- move from position to next dependent position
- probabilities of each position

Question 10

Q

weight array model

Answer

A

weight matrix that captures interdependencies
only used for splice sites
- all other features have WMM
- all amtrices combined for identification

Question 11

Q

genscan promoter identification

Answer

A

30% of promoters have no TATA box
- split prediction model according to this and use weightmodel
TATA:
- 0.7 probability
- 15bp TATA WMM and 8bp cap site WMM
TATA-less:
- 0.3
- intergenic-null regions of 40bp
genscan doesn’t require promoter identification for gene prediction

Question 12

Q

homology gene prediction

Answer

A

complements ab initio
BLAST search against swissprot and EST data from similar species
- match to EST confirms exon
  - >90% homology needed as high error rates
- swissprot - additional confirmation
pool to get composite picture
- genscan may find additional exons even if gene identified otherwise

Question 13

Q

phylogenetic footprinting

Answer

A

can be used for promoter prediction
look at same gene in multiple species with common ancestry
- expect conserved upstream region (promoter)
- coexpression of genes in the same species (RNAseq)
better if more distantly related
- greater mutaiton outside regulatory region
- conservation more prominent
- balance

Question 14

Q

repeats

Answer

A

simple:
- microsatellites
- polypurine/pyrimidine tracts
complex:
- LINES/SINES, LTRs, Alus
exact sequence is polymorphic - difficult to identify
ENCODE - unfinished
- assign function to all human genome elements

eukaryotic gene prediction Flashcards

(14 cards)