chapter 6 Flashcards

1
Q
  1. Explain gene prediction
A
  • Rapid genomic seq information
  • accurately predict gene structure
  • Detection of the location of open reading frames (ORFs)
  • And indicating the location of the structures of introns and exons
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the two major categories of gene prediction programs

A

2 major categories of gene prediction programs

  1. ab initio-based approaches
    • predict genes based on the given seq alone
  2. Homology-based approaches
    • predict genes based on significant matches of the query seq with seq of known genes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the goal of gene predication

A
  • Goal: to describe all the genes computationally with near 100% accuracy
  • More success in gene prediction algorithms for prokaryotic genomes than eukaryotes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q
  1. What is consensus-based?
A

• If a program combines prediction results from multiple programs to derive a consensus prediction ——— consensus-based

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q
  1. Explain the 2 major feature of ab initio-based approach
A

Relying on 2 major features associated with genes:

  • Existence of gene signals:
    • including start & stop codons,
    • intron splice signals,
    • transcription factor binding sites,
    • ribosomal binding sites,
    • Poly-A sites.
  • Gene content,
    • the statistical description of coding regions.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q
  1. What models can be used to detect the ab initio-based approach
A
  • The features can be detected by either:
  • Markov models
  • hidden Markov models (HMMs)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q
  1. Compare the two models of ab-initio-based approach
A

• Markov model

  • (aka Markov chain)
  • describes seq of events that occur one after another in a chain.
  • A Markov chain can be considered as a process that moves in one direction from one state to the next with transition probability.
  • e.g. traffic lights

• HMM

  • HMM combines 2 or more Markov chains with only one chain consisting of observed states and the other chains made up of unobserved (or “hidden”) states that influence the outcome of the observed states.
  • e.g. gaps don’t correspond to any residues —— unobservable states, but indirectly influence the transition probability of the observed states.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q
  1. Explain the Parameters for the HMM for DNA sequence.
A
  • Transition probability
  • Emission probability
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q
  1. Explain the Transition probability and emission probability of Partial HMM for DNA sequence.
A
  • Transition probability:
  • The probability going from one state to another
  • Emission probability:
  • The probability value associated with each symbol in each state

Example: To generate the sequence of AG: State one emission probability of A x transition probability x state two emission probability of G

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q
  1. Explain the homology-based approach
A
  • e.g. If a translated DNA seq is found to be similar to a known protein or protein family from a DB search —– can be a strong evidence that the region codes for a protein.
  • When possible exons of a genomic DNA region match a sequenced cDNA, this also provides experimental evidence for the existence of a coding region.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q
  1. Explain gene prediction in prokaryotes
A

• Small genome size: 0.5 – 10 Mbp, > 90% coding seq

  • Start codon: ATG (AUG in mRNA)/GTG/TTG
  • Problem:
  • Solution: Shine-Dalgarno sequence
    * Purine-rich seq complementary to 16S rRNA
    * Immediately downstream of transcription initiation site
    * Slightly upstream of the translation start codon
    * Consensus motif: AGGAGGT

• Coding region end: 3 possible stop codons or ρ–independent terminator: a distinct stem-loop 2° structure followed by a string of Ts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q
  1. Explain shine-delgarno sequence.
A
  • Shine-Dalgarno sequence
    * Purine-rich seq complementary to 16S rRNA
    * Immediately downstream of transcription initiation site
    * Slightly upstream of the translation start codon
    * Consensus motif: AGGAGGT
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q
  1. Explain conventional determination of open reading frames (ORFs)
A
  • Conceptual translation in all 6 possible frames: 3 forward and 3 reverse
  • A frame longer than 30 codons without interruption by stop codons —– a gene coding region
  • Putative frame is manually confirmed by the presence of start codon and Shine-Dalgarno seq.
  • blastp for homologs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q
  1. What is the markov model describe?
A

• Markov model describes the probability of the distribution of nucleotides in a DNA seq, in which the conditional probability of a particular seq position depends on κ previous positions, where κ is the order of a Markov model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Explain the diffrent order or degree of the Markov model

A
  • 0° Markov model: assumes each base occurs independently with a given probability.
  • 1° Markov model: the occurrence of a base depends on the base preceding it
  • 2° Markov model: looks at the preceding 2 bases to determine which base follows, which is more characteristic of codons in a coding seq.
  • … … … etc.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q
  1. Explain how gene prediction is possible with the Markov models and HMMs
A
  • 0° Markov model: assumes each base occurs independently with a given probability.
  • 1° Markov model: the occurrence of a base depends on the base preceding it
  • 2° Markov model: looks at the preceding 2 bases to determine which base follows, which is more characteristic of codons in a coding seq.
  • … … … etc.
  • A fixed-order Markov chain describes the probability of a particular nucleotide that depends on previous κ nucleotides à
  • The higher the order of a Markov model, the more accuracy it can predict a gene
  • Statistics showed the pairs of codons (or a.a. at the protein level) tend to correlate. So a 5th –order Markov model, which calculates the probability of hexamer bases, can detect nucleotide correlations found in coding regions more accurately and is in fact most often used.
17
Q

What are some exaples of HMM-based gene finding programs for prokaryotes

A
  • GeneMark
  • Glimmer
18
Q
  1. What are the four basic measures of gene prediction accuracy at the nucleotide level?
A

Basic measures of gene prediction accuracy at the nucleotide level:

  • True positive (TP) – a correctly predicted feature
  • False positive (FP) – an incorrectly predicted feature
  • False negative (FN) – a missed feature
  • True negative (TN) – correctly predicted absence of a feature
19
Q
  1. Explain sensitivity and specificity for performance evaluation
A
  • Sensitivity = TP / (TP + FN)
  • Specificity = TP / (TP + FP)
20
Q
  1. Explain Gene prediction in eukaryotes
A

• Much larger size: 10 Mbp – 670 Gbp (1 Gbp = 109 bp)
• Low gene density
e.g. human: 3% à genes: (~1 gene/100 Kbp)

• Intergenic space: rich in repetitive sequences and transposable elements

  • 3 modifications —- mature mRNA:
  • 5’ cap: involving methylation at the initial residue of the RNA
  • Splicing: removing introns and joining exons (involving a large RNA-protein complex, spliceosome)
  • Polyadenylation: the addition of a stretch of As (~250) at the 3’ end of RNA. [poly-A signal: downsteam of a coding region with CAATAAA(T/C)
21
Q
  1. What is the major issue for gene prediction in eukaryotes
A
  • Main issue:
  • Identification of exons, introns and splicing sites
22
Q
  1. What is the solution to the problem when using gene prediction in eukaryotes
A
  • Solutions:
  • GT-AG rule (5’ GTAAGT…….NCAG 3’)
  • Most vertebrates: Kozak sequence (CCGCCATGG)
23
Q

What are some gene prediction programs for eukaryotic genomes?

A

Ab initio-based programs

  • GRAIL
  • GENSCAN
  • HMMgene

Homology-based programs

  • GenomeScanT
  • TwinScan

Consensus-based programs

  • GeneComber
  • DIGIT