Gene Finding Flashcards

Week 2 Lecture 1

1
Q

Describe the eukaryotic genome structure.

A
  • Exons and introns that make up the ORF
  • The promoter is the nucleotide sequence to which RNA polymerase binds to initiate transcription
  • Initiation codon (usually ATG)
  • Termination codons (TAA, TAG, TGA)
  • LINEs, SINEs, LTR elements
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is TRY4?

A
  • A gene that is responsible for the synthesis of trypsinogen (trypsin precursor)
  • Discontinuous gene (split into 5 exons and 4 introns)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are V28 and V29-1?

A
  • Discontinuous gene segments that specify part of the beta T-cell receptor protein
  • Not complete genes and must be linked to each other by gene splicing
  • Results in a permanent genome change during cell differentiation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does gene finding aim to find?

A
  • Which regions code for proteins
  • Gene start and end regions
  • Exon/intron boundaries
  • Regulatory sequences
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe prokaryotic genome finding.

A

Prokaryotes have small genomes (0.5-5 Mbp) with a high coding density (>90%) and no introns. This makes gene identification relatively easy (~99% success rate). Problems include overlapping ORFs (due to the prokaryotic genome being so small), short genes missing the cutoff point (roughly 50 aa), and finding promoters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Describe eukaryotic genome finding.

A

Eukaryotes have large genomes (10-120,000 Mbp) with a low coding density (<50%, 2-3% in humans), and they contain introns. This makes gene identification relatively difficult (~50% success rate). There are many problems in eukaryotic genome finding.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Methods of gene finding

A
  • Ab initio methods
  • Similarity-based methods
  • Integrated approaches
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are Ab initio methods?

A

Making predictions based on typical gene features such as splice signals and sequence composition. Regions to look for include initial 5’ exons, internal exons, and final 3’ exons.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are similarity-based methods?

A
  • Predict genes by adding information from the whole genome sequence from related species
  • The similarity between the query sequence and the known coding sequence are used to infer gene structures
  • Allows predicted exon sequences located by ORF scanning to be tested for functionality
  • Results are influenced by the availability of close homologues
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Problems with similarity-based methods

A
  • Different genes are expressed differently in different tissues so you can’t just sequence mRNA from a random tissue, you need to sequence everything under a lot of different conditions
  • Might be different due to alternative splicing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

ORF scanning in prokaryotes

A
  • Search DNA for start and stop codons (may end up with false positives)
  • Search for promoters
  • Since most genes are >50 codons you apply a cutoff of ~100, but you also lose the short genes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

ORF scanning in eukaryotes

A
  • Search DNA sequence for start and stop codons
  • Search for promoters
  • Search for intron/exon boundaries (spliced mRNA does not contain introns)
  • Since most genes are >50 codons you apply a cutoff of ~100
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Codon usage in genomes

A
  • Codons are used unequally, this is a universal feature of genomes
  • You can use this to differentiate coding and non-coding regions (e.g. humans use GTG 4x more than GTA for valine)
  • Real exons are expected to show codon bias but introns should not
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

ORF scanning and moving windows

A

Sequence information only is used to identify coding exons through integrating coding statistics. We want to calculate the likelihood that a triplet is in a coding region and plot a graph of it (above zero is likely below zero is unlikely)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

ORF scanning: exon/intron boundaries

A

Exon-intron boundaries have distinctive sequence features.
- Upstream boundary: invariant GT and consensus sequence
- Downstream boundary: T or C, any amino acid, then CAG

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

ORF scanning: upstream regulatory sequences

A

Locate where genes begin using distinct sequence features (e.g. recognition signals for DNA-binding proteins). Regulatory sequences are variable and difficult to incorporate into gene prediction algorithms)

17
Q

Best Ab initio methods

A

Based on HMMs, a machine learning approach that takes sequences and encodes them in a statistical framework

18
Q

Examples of Ab initio methods

A

GENSCAN, HMMgene, GeneMark

19
Q

What is GenScan?

A

GenScan identifies complete intron/exon structures of genes in genomic DNA and predicts multiple genes, partial and complete genes. It uses HMM to model gene structure and has separate HMMs for exons, introns, and intergenic regions. There are different parameters for regions with different GC content.

20
Q

P values in GenScan

A

P is the probability that the exon is correct.
When P>0.99, the exon is almost exactly correct.
0.50<=P<=0.99, the exon is correct most of the time.
P<0.50, not reliable

21
Q

Sensitivity (nucleotide level accuracy)

A

no. of correct exons/no. of actual exons

22
Q

Specificity (nucleotide level accuracy)

A

no. of correct exons/no. of predicted exons

23
Q

Sensitivity (exon level accuracy)

A

true prediction/(actual exons + missed exons)

24
Q

Specificity (exon level accuracy)

A

true prediction/(true prediction + false prediction)

25
Q

Advantages of Ab initio gene prediction

A
  • Good at predicting nucleotides (>90%)
  • Improve accuracy by combining methods
  • Easier for prokaryotes because there are no introns
26
Q

Disadvantages of Ab initio gene prediction

A
  • Moderate at finding exon boundaries (70-75% correct per exon)
  • Poor at predicting complete gene structure (<50%)
  • Need to identify upstream regulatory sequences
27
Q

What did Rogic et al. (2002) publish

A

A way to improve gene prediction accuracy y combining GenScan and HMMgene.
- OR returns regions predicted by either program (more sensitive, less specific), leading to overprediction
- AND returns regions predicted by both programs (less sensitive, more specific), leading to underprediction
- EUI (exon union-intersection), OR if above a given significance threshold, AND otherwise

28
Q

Examples of similarity-based gene prediction methods

A

AAT, GeneWise, SGP2, Rosetta

29
Q

What is SGP2

A

Synthetic Gene Prediction, the query sequence is compared against sequences from the informant genome. The results of the comparison are used to modify exons predicted by ab initio prediction. It uses an integrated approach.

30
Q

How do we confirm putative genes?

A
  1. If the gene is matched to one or more ESTs from the same organism.
  2. There is a similarity of the nucleotide/translated protein sequence to sequences in databases.
  3. There is a match for the protein sequence in a secondary databank.
  4. There is an association with predicted promoter sequences.
31
Q

Problems with gene finding in general

A
  • Imprecise or incomplete
  • Splicing incorrect
  • False positives
  • Failure to identify true genes
  • Doesn’t account for PTHMs (ligands, glycosylation, methylation, peptide excision)