Lecture 7 gene prediction Flashcards

1
Q

Once a genome seq info has been successfully sequenced and assembled, what type of approach is used to predict its gene structure?

A

computational approaches

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are computational gene predictions and what do they include?

A

Computational gene prediction is necessary to obtain comprehensive functional information on genes and genomes. The process includes detection of the location of open reading frames (ORFs).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What else does the computational gene prediction require in eukaryotes?

A

Description of the structures of introns/exons.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the main goal of gene prediction?

A

The main goal is to describe all the genes computationally with 100% accuracy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What aren’t conserved in coding regions and how does it effect gene prediction?

A

Motifs aren’t conserved in coding regions making gene prediction one of the most difficult problems in the field of pattern recognition.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the 3 way for finding genes in genomes?

A

1) Similarity-based or Comparative

2) Ab initio = “from the beginning”

3) Combined “evidence-based” (BEST)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

1) Similarity-based or Comparative

A
  • BLAST - Do other organisms have similar sequence?
    (Is sequence similar to known gene or protein)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

2) Ab initio

A
  • Ab initio meaning, “from the beginning” predicts without explicit comparison with cDNA or proteins via
    “rule-based” gene models - but rules are derived from statistical analysis of datasets
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

3) Combined “evidence-based”

A
  • Combine gene models with alignment to known ESTs & protein sequences
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the gene density in prokaryotes

A

High, more than 90% of their genome contains coding seq w very few repetitive sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the gene prediction in prokaryotes?

A

Each prokaryotic gene is composed of a single contiguous stretch of ORF coding for a single protein or RNA with no interruptions within a gene.
- bac genes have a start codon ATG. GTG and TTG are used as alternative start codons at times.
- At the end of the protein coding region is a stop codon, TAA, TAG, TGA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

As there may be multiple ATG, GTG, or TGT codons in a frame in prokaryotes, how can the start codon be located?

A

-Identification of the ribosome binding site (Shine-Delgarno sequence) can help locate the start codon. The ribosomal binding site is located slightly upstream of the translation start codon and has a consensus motif of AGGAGGT.
- Identification o the stop Condon is straight forward

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the ribosomal binding site/ Shine-Delgarno sequence?

A

a stretch of purine-rich sequence complementary to 16S rRNA in the ribosome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How can potential coding regions be detected?

A

by looking for ORFs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What kind of ORF should be used and how can a purposed frame be confirmed for the presence of the gene?

A
  • Long open reading frames may be a gene
  • A basic approach is to scan for ORFs whose length exceeds certain threshold (60 amino acids/180 nucleotides)
    – A proposed frame can be confirmed by the presence of other signals such as the Shine–Delgarno sequence.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

When should a stop codon be seen at random?

A

one stop codon every (64/3) = 21 codons

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

what is a disadvantage of using stop codons in an ORF to detect a gene

A

genes are usually longer than 21 codons therefore if stop codons are used a whole gene may not be identified.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the disadvantage of using using certain thresholds to scan ORFs

A

some genes (e.g. some neural and immune system genes) are relatively short, therefore using a long ORF would be inaccurate.

19
Q

What is th strongest indicator of a protein-coding frame.

A

Detection of homologs from searching long ORFs against database of confirmed genes using BLAST

20
Q

what can testing the GC bias in ORF exam?

A

the non-randomness of nucleotide distribution in ORFs.

21
Q

in a coding seq what nucleotides is usually in third position?

A

G or C

22
Q

How can GC bias be used to identify a ORF

A

By plotting the probability of GC composition in the 3rd position, regions with values significantly above the random level can be identified, we are searching for ORFs that have higher levels of G/C at the third codon position relative to what we would see by chance alone. We are looking for enrichment.

23
Q

how can Condon usage be used to test an ORF

A

By creating a 64-element hash table and counting the frequencies of codons in an ORF

24
Q

How does the uneven usage of codons in nature compensate for pitfalls of the ORF length test?

A

Amino acids typically have more than one codon, but in nature certain codons are preferred, therefore we test for ORFs that have the “likely” codon usage, this compensates for pitfalls of the ORF length test as an ORF is more “believable” than another if it has more “likely” codons.

25
Q

Markov Models

A

A Markov model explains the likelihood of the arrangement of nucleotides in a DNA sequence, where the probability of a specific position in the sequence is influenced by the preceding k positions.

26
Q

Markov models are a well known tool for analysing sequence data and are used by ?

A

GeneMark and Glimmer.

27
Q

By looking at proceeding bases what does fixed order Markov models predict?

A

each base of a sequence

28
Q

Zero- order Markov model

A

Assumes each base occurs independently with a given probability.

29
Q

1st-order Markov model

A

looks at the preceding base to determine what base will follow.

30
Q

2nd- order Markov model

A

looks at the preceding two bases to determine what base will follow.

31
Q

What fact do Markov models exploit

A

Markov models exploit the fact that oligonucleotide distributions in coding regions are different from those for the noncoding regions.

32
Q

what does a longer oligomer unit indicate?

A

The longer the oligomer unit, the more non-randomness can be described for the coding region.

33
Q

The higher the order of a Markov model….

A

the more accurately it can predict a gene.

34
Q

How are Markov Models built

A

Markov models are built in sets of three nucleotides, describing non- random distributions of trimers or hexamers, etc.

35
Q

What are the parameters of Markov models trained with?

A

set of sequences with known gene locations.

36
Q

What can the accuracy of a prediction programme be evaluated with?

A

parameters such as sensitivity and specificity.

37
Q

Sensitivity

A

the ability to include correct predictions.

38
Q

Specificity

A

ability to exclude incorrect predictions.

39
Q

When is a program considered accurate?

A

both sensitivity and specificity are simultaneously high and approach a value of 1.

40
Q

whe sensitivity is high but specificity is low?

A

the program is said to have a tendency to overpredict.

41
Q

When sensitivity is low but specificity high,

A

the program is too conservative and lacks predictive power.

42
Q

What four features are used to describe the concept of sensitivity and specificity accurately?

A
  • True Positive (TP), which is a correctly predicted feature/gene.
  • False Positive (FP), which is an incorrectly predicted feature/gene.
  • False Negative (FN), which is a missed feature/gene.
  • True Negative (TN), which is the correctly predicted absence of a feature/gene.
43
Q

Sensitivity is given by?

A

Sn = TP/(TP + FN).

44
Q

Specificity is given by

A

Sp = TP/(TP + FP).