chapter 6 Flashcards
- Explain gene prediction
- Rapid genomic seq information
- accurately predict gene structure
- Detection of the location of open reading frames (ORFs)
- And indicating the location of the structures of introns and exons
What are the two major categories of gene prediction programs
2 major categories of gene prediction programs
- ab initio-based approaches
- predict genes based on the given seq alone
- Homology-based approaches
- predict genes based on significant matches of the query seq with seq of known genes
What is the goal of gene predication
- Goal: to describe all the genes computationally with near 100% accuracy
- More success in gene prediction algorithms for prokaryotic genomes than eukaryotes
- What is consensus-based?
• If a program combines prediction results from multiple programs to derive a consensus prediction ——— consensus-based
- Explain the 2 major feature of ab initio-based approach
Relying on 2 major features associated with genes:
- Existence of gene signals:
- including start & stop codons,
- intron splice signals,
- transcription factor binding sites,
- ribosomal binding sites,
- Poly-A sites.
- Gene content,
- the statistical description of coding regions.
- What models can be used to detect the ab initio-based approach
- The features can be detected by either:
- Markov models
- hidden Markov models (HMMs)
- Compare the two models of ab-initio-based approach
• Markov model
- (aka Markov chain)
- describes seq of events that occur one after another in a chain.
- A Markov chain can be considered as a process that moves in one direction from one state to the next with transition probability.
- e.g. traffic lights
• HMM
- HMM combines 2 or more Markov chains with only one chain consisting of observed states and the other chains made up of unobserved (or “hidden”) states that influence the outcome of the observed states.
- e.g. gaps don’t correspond to any residues —— unobservable states, but indirectly influence the transition probability of the observed states.
- Explain the Parameters for the HMM for DNA sequence.
- Transition probability
- Emission probability
- Explain the Transition probability and emission probability of Partial HMM for DNA sequence.
- Transition probability:
- The probability going from one state to another
- Emission probability:
- The probability value associated with each symbol in each state
Example: To generate the sequence of AG: State one emission probability of A x transition probability x state two emission probability of G
- Explain the homology-based approach
- e.g. If a translated DNA seq is found to be similar to a known protein or protein family from a DB search —– can be a strong evidence that the region codes for a protein.
- When possible exons of a genomic DNA region match a sequenced cDNA, this also provides experimental evidence for the existence of a coding region.
- Explain gene prediction in prokaryotes
• Small genome size: 0.5 – 10 Mbp, > 90% coding seq
- Start codon: ATG (AUG in mRNA)/GTG/TTG
- Problem:
- Solution: Shine-Dalgarno sequence
* Purine-rich seq complementary to 16S rRNA
* Immediately downstream of transcription initiation site
* Slightly upstream of the translation start codon
* Consensus motif: AGGAGGT
• Coding region end: 3 possible stop codons or ρ–independent terminator: a distinct stem-loop 2° structure followed by a string of Ts
- Explain shine-delgarno sequence.
- Shine-Dalgarno sequence
* Purine-rich seq complementary to 16S rRNA
* Immediately downstream of transcription initiation site
* Slightly upstream of the translation start codon
* Consensus motif: AGGAGGT
- Explain conventional determination of open reading frames (ORFs)
- Conceptual translation in all 6 possible frames: 3 forward and 3 reverse
- A frame longer than 30 codons without interruption by stop codons —– a gene coding region
- Putative frame is manually confirmed by the presence of start codon and Shine-Dalgarno seq.
- blastp for homologs
- What is the markov model describe?
• Markov model describes the probability of the distribution of nucleotides in a DNA seq, in which the conditional probability of a particular seq position depends on κ previous positions, where κ is the order of a Markov model.
Explain the diffrent order or degree of the Markov model
- 0° Markov model: assumes each base occurs independently with a given probability.
- 1° Markov model: the occurrence of a base depends on the base preceding it
- 2° Markov model: looks at the preceding 2 bases to determine which base follows, which is more characteristic of codons in a coding seq.
- … … … etc.
- Explain how gene prediction is possible with the Markov models and HMMs
- 0° Markov model: assumes each base occurs independently with a given probability.
- 1° Markov model: the occurrence of a base depends on the base preceding it
- 2° Markov model: looks at the preceding 2 bases to determine which base follows, which is more characteristic of codons in a coding seq.
- … … … etc.
- A fixed-order Markov chain describes the probability of a particular nucleotide that depends on previous κ nucleotides à
- The higher the order of a Markov model, the more accuracy it can predict a gene
- Statistics showed the pairs of codons (or a.a. at the protein level) tend to correlate. So a 5th –order Markov model, which calculates the probability of hexamer bases, can detect nucleotide correlations found in coding regions more accurately and is in fact most often used.
What are some exaples of HMM-based gene finding programs for prokaryotes
- GeneMark
- Glimmer
- What are the four basic measures of gene prediction accuracy at the nucleotide level?
Basic measures of gene prediction accuracy at the nucleotide level:
- True positive (TP) – a correctly predicted feature
- False positive (FP) – an incorrectly predicted feature
- False negative (FN) – a missed feature
- True negative (TN) – correctly predicted absence of a feature
- Explain sensitivity and specificity for performance evaluation
- Sensitivity = TP / (TP + FN)
- Specificity = TP / (TP + FP)
- Explain Gene prediction in eukaryotes
• Much larger size: 10 Mbp – 670 Gbp (1 Gbp = 109 bp)
• Low gene density
e.g. human: 3% à genes: (~1 gene/100 Kbp)
• Intergenic space: rich in repetitive sequences and transposable elements
- 3 modifications —- mature mRNA:
- 5’ cap: involving methylation at the initial residue of the RNA
- Splicing: removing introns and joining exons (involving a large RNA-protein complex, spliceosome)
- Polyadenylation: the addition of a stretch of As (~250) at the 3’ end of RNA. [poly-A signal: downsteam of a coding region with CAATAAA(T/C)
- What is the major issue for gene prediction in eukaryotes
- Main issue:
- Identification of exons, introns and splicing sites
- What is the solution to the problem when using gene prediction in eukaryotes
- Solutions:
- GT-AG rule (5’ GTAAGT…….NCAG 3’)
- Most vertebrates: Kozak sequence (CCGCCATGG)
What are some gene prediction programs for eukaryotic genomes?
Ab initio-based programs
- GRAIL
- GENSCAN
- HMMgene
Homology-based programs
- GenomeScanT
- TwinScan
Consensus-based programs
- GeneComber
- DIGIT