chapter 6 Flashcards

Question 1

Q

Explain gene prediction

Answer

A

Rapid genomic seq information
accurately predict gene structure
Detection of the location of open reading frames (ORFs)
And indicating the location of the structures of introns and exons

Question 2

Q

What are the two major categories of gene prediction programs

Answer

A

2 major categories of gene prediction programs

ab initio-based approaches
- predict genes based on the given seq alone
Homology-based approaches
- predict genes based on significant matches of the query seq with seq of known genes

Question 3

Q

What is the goal of gene predication

Answer

A

Goal: to describe all the genes computationally with near 100% accuracy
More success in gene prediction algorithms for prokaryotic genomes than eukaryotes

Question 4

Q

What is consensus-based?

Answer

A

• If a program combines prediction results from multiple programs to derive a consensus prediction ——— consensus-based

Question 5

Q

Explain the 2 major feature of ab initio-based approach

Answer

A

Relying on 2 major features associated with genes:

Existence of gene signals:
- including start & stop codons,
- intron splice signals,
- transcription factor binding sites,
- ribosomal binding sites,
- Poly-A sites.
Gene content,
- the statistical description of coding regions.

Question 6

Q

What models can be used to detect the ab initio-based approach

Answer

A

The features can be detected by either:
Markov models
hidden Markov models (HMMs)

Question 7

Q

Compare the two models of ab-initio-based approach

Answer

A

• Markov model

(aka Markov chain)
describes seq of events that occur one after another in a chain.
A Markov chain can be considered as a process that moves in one direction from one state to the next with transition probability.
e.g. traffic lights

• HMM

HMM combines 2 or more Markov chains with only one chain consisting of observed states and the other chains made up of unobserved (or “hidden”) states that influence the outcome of the observed states.
e.g. gaps don’t correspond to any residues —— unobservable states, but indirectly influence the transition probability of the observed states.

Question 8

Q

Explain the Parameters for the HMM for DNA sequence.

Answer

A

Transition probability
Emission probability

Question 9

Q

Explain the Transition probability and emission probability of Partial HMM for DNA sequence.

Answer

A

Transition probability:
The probability going from one state to another
Emission probability:
The probability value associated with each symbol in each state

Example: To generate the sequence of AG: State one emission probability of A x transition probability x state two emission probability of G

Question 10

Q

Explain the homology-based approach

Answer

A

e.g. If a translated DNA seq is found to be similar to a known protein or protein family from a DB search —– can be a strong evidence that the region codes for a protein.
When possible exons of a genomic DNA region match a sequenced cDNA, this also provides experimental evidence for the existence of a coding region.

Question 11

Q

Explain gene prediction in prokaryotes

Answer

A

• Small genome size: 0.5 – 10 Mbp, > 90% coding seq

Start codon: ATG (AUG in mRNA)/GTG/TTG
Problem:
Solution: Shine-Dalgarno sequence
* Purine-rich seq complementary to 16S rRNA
* Immediately downstream of transcription initiation site
* Slightly upstream of the translation start codon
* Consensus motif: AGGAGGT

• Coding region end: 3 possible stop codons or ρ–independent terminator: a distinct stem-loop 2° structure followed by a string of Ts

Question 12

Q

Explain shine-delgarno sequence.

Answer

A

Shine-Dalgarno sequence
* Purine-rich seq complementary to 16S rRNA
* Immediately downstream of transcription initiation site
* Slightly upstream of the translation start codon
* Consensus motif: AGGAGGT

Question 13

Q

Explain conventional determination of open reading frames (ORFs)

Answer

A

Conceptual translation in all 6 possible frames: 3 forward and 3 reverse
A frame longer than 30 codons without interruption by stop codons —– a gene coding region
Putative frame is manually confirmed by the presence of start codon and Shine-Dalgarno seq.
blastp for homologs

Question 14

Q

What is the markov model describe?

Answer

A

• Markov model describes the probability of the distribution of nucleotides in a DNA seq, in which the conditional probability of a particular seq position depends on κ previous positions, where κ is the order of a Markov model.

Question 15

Q

Explain the diffrent order or degree of the Markov model

Answer

A

0° Markov model: assumes each base occurs independently with a given probability.
1° Markov model: the occurrence of a base depends on the base preceding it
2° Markov model: looks at the preceding 2 bases to determine which base follows, which is more characteristic of codons in a coding seq.
… … … etc.

Question 16

Q

Explain how gene prediction is possible with the Markov models and HMMs

Answer

Study These Flashcards

A

0° Markov model: assumes each base occurs independently with a given probability.
1° Markov model: the occurrence of a base depends on the base preceding it
2° Markov model: looks at the preceding 2 bases to determine which base follows, which is more characteristic of codons in a coding seq.
… … … etc.
A fixed-order Markov chain describes the probability of a particular nucleotide that depends on previous κ nucleotides à
The higher the order of a Markov model, the more accuracy it can predict a gene
Statistics showed the pairs of codons (or a.a. at the protein level) tend to correlate. So a 5th –order Markov model, which calculates the probability of hexamer bases, can detect nucleotide correlations found in coding regions more accurately and is in fact most often used.

Question 17

Q

What are some exaples of HMM-based gene finding programs for prokaryotes

Answer

Study These Flashcards

A

GeneMark
Glimmer

Question 18

Q

What are the four basic measures of gene prediction accuracy at the nucleotide level?

Answer

Study These Flashcards

A

Basic measures of gene prediction accuracy at the nucleotide level:

True positive (TP) – a correctly predicted feature
False positive (FP) – an incorrectly predicted feature
False negative (FN) – a missed feature
True negative (TN) – correctly predicted absence of a feature

Question 19

Q

Explain sensitivity and specificity for performance evaluation

Answer

Study These Flashcards

A

Sensitivity = TP / (TP + FN)
Specificity = TP / (TP + FP)

Question 20

Q

Explain Gene prediction in eukaryotes

Answer

Study These Flashcards

A

• Much larger size: 10 Mbp – 670 Gbp (1 Gbp = 109 bp)
• Low gene density
e.g. human: 3% à genes: (~1 gene/100 Kbp)

• Intergenic space: rich in repetitive sequences and transposable elements

3 modifications —- mature mRNA:
5’ cap: involving methylation at the initial residue of the RNA
Splicing: removing introns and joining exons (involving a large RNA-protein complex, spliceosome)
Polyadenylation: the addition of a stretch of As (~250) at the 3’ end of RNA. [poly-A signal: downsteam of a coding region with CAATAAA(T/C)

Question 21

Q

What is the major issue for gene prediction in eukaryotes

Answer

Study These Flashcards

A

Main issue:
Identification of exons, introns and splicing sites

Question 22

Q

What is the solution to the problem when using gene prediction in eukaryotes

Answer

Study These Flashcards

A

Solutions:
GT-AG rule (5’ GTAAGT…….NCAG 3’)
Most vertebrates: Kozak sequence (CCGCCATGG)

Question 23

Q

What are some gene prediction programs for eukaryotic genomes?

Answer

Study These Flashcards

A

Ab initio-based programs

GRAIL
GENSCAN
HMMgene

Homology-based programs

GenomeScanT
TwinScan

Consensus-based programs

GeneComber
DIGIT

chapter 6 Flashcards

(23 cards)