Ab initio meaning, “from the beginning” predicts without explicit comparison with cDNA or proteins via “rule-based” gene models - but rules are derived from statistical analysis of datasets

Lecture 7 gene prediction Flashcards by Simbiat Ayoola

Once a genome seq info has been successfully sequenced and assembled, what type of approach is used to predict its gene structure?

computational approaches

How well did you know this?

Not at all

Perfectly

What are computational gene predictions and what do they include?

Computational gene prediction is necessary to obtain comprehensive functional information on genes and genomes. The process includes detection of the location of open reading frames (ORFs).

How well did you know this?

Not at all

Perfectly

What else does the computational gene prediction require in eukaryotes?

Description of the structures of introns/exons.

How well did you know this?

Not at all

Perfectly

What is the main goal of gene prediction?

The main goal is to describe all the genes computationally with 100% accuracy.

How well did you know this?

Not at all

Perfectly

What aren’t conserved in coding regions and how does it effect gene prediction?

Motifs aren’t conserved in coding regions making gene prediction one of the most difficult problems in the field of pattern recognition.

How well did you know this?

Not at all

Perfectly

What are the 3 way for finding genes in genomes?

1) Similarity-based or Comparative

2) Ab initio = “from the beginning”

3) Combined “evidence-based” (BEST)

How well did you know this?

Not at all

Perfectly

1) Similarity-based or Comparative

BLAST - Do other organisms have similar sequence?
(Is sequence similar to known gene or protein)

How well did you know this?

Not at all

Perfectly

2) Ab initio

Ab initio meaning, “from the beginning” predicts without explicit comparison with cDNA or proteins via
“rule-based” gene models - but rules are derived from statistical analysis of datasets

How well did you know this?

Not at all

Perfectly

3) Combined “evidence-based”

Combine gene models with alignment to known ESTs & protein sequences

How well did you know this?

Not at all

Perfectly

What is the gene density in prokaryotes

High, more than 90% of their genome contains coding seq w very few repetitive sequences

How well did you know this?

Not at all

Perfectly

What is the gene prediction in prokaryotes?

Each prokaryotic gene is composed of a single contiguous stretch of ORF coding for a single protein or RNA with no interruptions within a gene.
- bac genes have a start codon ATG. GTG and TTG are used as alternative start codons at times.
- At the end of the protein coding region is a stop codon, TAA, TAG, TGA

How well did you know this?

Not at all

Perfectly

As there may be multiple ATG, GTG, or TGT codons in a frame in prokaryotes, how can the start codon be located?

-Identification of the ribosome binding site (Shine-Delgarno sequence) can help locate the start codon. The ribosomal binding site is located slightly upstream of the translation start codon and has a consensus motif of AGGAGGT.
- Identification o the stop Condon is straight forward

How well did you know this?

Not at all

Perfectly

What is the ribosomal binding site/ Shine-Delgarno sequence?

a stretch of purine-rich sequence complementary to 16S rRNA in the ribosome.

How well did you know this?

Not at all

Perfectly

How can potential coding regions be detected?

by looking for ORFs

How well did you know this?

Not at all

Perfectly

What kind of ORF should be used and how can a purposed frame be confirmed for the presence of the gene?

Long open reading frames may be a gene
A basic approach is to scan for ORFs whose length exceeds certain threshold (60 amino acids/180 nucleotides)
– A proposed frame can be confirmed by the presence of other signals such as the Shine–Delgarno sequence.

How well did you know this?

Not at all

Perfectly

When should a stop codon be seen at random?

one stop codon every (64/3) = 21 codons

How well did you know this?

Not at all

Perfectly

what is a disadvantage of using stop codons in an ORF to detect a gene

genes are usually longer than 21 codons therefore if stop codons are used a whole gene may not be identified.

How well did you know this?

Not at all

Perfectly

What is the disadvantage of using using certain thresholds to scan ORFs

Study These Flashcards

some genes (e.g. some neural and immune system genes) are relatively short, therefore using a long ORF would be inaccurate.

What is th strongest indicator of a protein-coding frame.

Study These Flashcards

Detection of homologs from searching long ORFs against database of confirmed genes using BLAST

what can testing the GC bias in ORF exam?

Study These Flashcards

the non-randomness of nucleotide distribution in ORFs.

in a coding seq what nucleotides is usually in third position?

Study These Flashcards

G or C

How can GC bias be used to identify a ORF

Study These Flashcards

By plotting the probability of GC composition in the 3rd position, regions with values significantly above the random level can be identified, we are searching for ORFs that have higher levels of G/C at the third codon position relative to what we would see by chance alone. We are looking for enrichment.

how can Condon usage be used to test an ORF

Study These Flashcards

By creating a 64-element hash table and counting the frequencies of codons in an ORF

How does the uneven usage of codons in nature compensate for pitfalls of the ORF length test?

Study These Flashcards

Amino acids typically have more than one codon, but in nature certain codons are preferred, therefore we test for ORFs that have the “likely” codon usage, this compensates for pitfalls of the ORF length test as an ORF is more “believable” than another if it has more “likely” codons.

Markov Models

A Markov model explains the likelihood of the arrangement of nucleotides in a DNA sequence, where the probability of a specific position in the sequence is influenced by the preceding k positions.

Markov models are a well known tool for analysing sequence data and are used by ?

GeneMark and Glimmer.

By looking at proceeding bases what does fixed order Markov models predict?

each base of a sequence

Zero- order Markov model

Assumes each base occurs independently with a given probability.

1st-order Markov model

looks at the preceding base to determine what base will follow.

2nd- order Markov model

looks at the preceding two bases to determine what base will follow.

What fact do Markov models exploit

Markov models exploit the fact that oligonucleotide distributions in coding regions are different from those for the noncoding regions.

what does a longer oligomer unit indicate?

The longer the oligomer unit, the more non-randomness can be described for the coding region.

The higher the order of a Markov model....

the more accurately it can predict a gene.

How are Markov Models built

Markov models are built in sets of three nucleotides, describing non- random distributions of trimers or hexamers, etc.

What are the parameters of Markov models trained with?

set of sequences with known gene locations.

What can the accuracy of a prediction programme be evaluated with?

parameters such as sensitivity and specificity.

Sensitivity

the ability to include correct predictions.

Specificity

ability to exclude incorrect predictions.

When is a program considered accurate?

both sensitivity and specificity are simultaneously high and approach a value of 1.

whe sensitivity is high but specificity is low?

the program is said to have a tendency to overpredict.

When sensitivity is low but specificity high,

the program is too conservative and lacks predictive power.

What four features are used to describe the concept of sensitivity and specificity accurately?

- True Positive (TP), which is a correctly predicted feature/gene. - False Positive (FP), which is an incorrectly predicted feature/gene. - False Negative (FN), which is a missed feature/gene. - True Negative (TN), which is the correctly predicted absence of a feature/gene.

Sensitivity is given by?

Sn = TP/(TP + FN).

Specificity is given by

Sp = TP/(TP + FP).

Lecture 7 gene prediction Flashcards

(44 cards)