- A gene that is responsible for the synthesis of trypsinogen (trypsin precursor) - Discontinuous gene (split into 5 exons and 4 introns)

Gene Finding Flashcards by Asha Shinde

Describe the eukaryotic genome structure.

Exons and introns that make up the ORF
The promoter is the nucleotide sequence to which RNA polymerase binds to initiate transcription
Initiation codon (usually ATG)
Termination codons (TAA, TAG, TGA)
LINEs, SINEs, LTR elements

How well did you know this?

Not at all

Perfectly

What is TRY4?

A gene that is responsible for the synthesis of trypsinogen (trypsin precursor)
Discontinuous gene (split into 5 exons and 4 introns)

How well did you know this?

Not at all

Perfectly

What are V28 and V29-1?

Discontinuous gene segments that specify part of the beta T-cell receptor protein
Not complete genes and must be linked to each other by gene splicing
Results in a permanent genome change during cell differentiation

How well did you know this?

Not at all

Perfectly

What does gene finding aim to find?

Which regions code for proteins
Gene start and end regions
Exon/intron boundaries
Regulatory sequences

How well did you know this?

Not at all

Perfectly

Describe prokaryotic genome finding.

Prokaryotes have small genomes (0.5-5 Mbp) with a high coding density (>90%) and no introns. This makes gene identification relatively easy (~99% success rate). Problems include overlapping ORFs (due to the prokaryotic genome being so small), short genes missing the cutoff point (roughly 50 aa), and finding promoters.

How well did you know this?

Not at all

Perfectly

Describe eukaryotic genome finding.

Eukaryotes have large genomes (10-120,000 Mbp) with a low coding density (<50%, 2-3% in humans), and they contain introns. This makes gene identification relatively difficult (~50% success rate). There are many problems in eukaryotic genome finding.

How well did you know this?

Not at all

Perfectly

Methods of gene finding

Ab initio methods
Similarity-based methods
Integrated approaches

How well did you know this?

Not at all

Perfectly

What are Ab initio methods?

Making predictions based on typical gene features such as splice signals and sequence composition. Regions to look for include initial 5’ exons, internal exons, and final 3’ exons.

How well did you know this?

Not at all

Perfectly

What are similarity-based methods?

Predict genes by adding information from the whole genome sequence from related species
The similarity between the query sequence and the known coding sequence are used to infer gene structures
Allows predicted exon sequences located by ORF scanning to be tested for functionality
Results are influenced by the availability of close homologues

How well did you know this?

Not at all

Perfectly

Problems with similarity-based methods

Different genes are expressed differently in different tissues so you can’t just sequence mRNA from a random tissue, you need to sequence everything under a lot of different conditions
Might be different due to alternative splicing

How well did you know this?

Not at all

Perfectly

ORF scanning in prokaryotes

Search DNA for start and stop codons (may end up with false positives)
Search for promoters
Since most genes are >50 codons you apply a cutoff of ~100, but you also lose the short genes

How well did you know this?

Not at all

Perfectly

ORF scanning in eukaryotes

Search DNA sequence for start and stop codons
Search for promoters
Search for intron/exon boundaries (spliced mRNA does not contain introns)
Since most genes are >50 codons you apply a cutoff of ~100

How well did you know this?

Not at all

Perfectly

Codon usage in genomes

Codons are used unequally, this is a universal feature of genomes
You can use this to differentiate coding and non-coding regions (e.g. humans use GTG 4x more than GTA for valine)
Real exons are expected to show codon bias but introns should not

How well did you know this?

Not at all

Perfectly

ORF scanning and moving windows

Sequence information only is used to identify coding exons through integrating coding statistics. We want to calculate the likelihood that a triplet is in a coding region and plot a graph of it (above zero is likely below zero is unlikely)

How well did you know this?

Not at all

Perfectly

ORF scanning: exon/intron boundaries

Exon-intron boundaries have distinctive sequence features.
- Upstream boundary: invariant GT and consensus sequence
- Downstream boundary: T or C, any amino acid, then CAG

How well did you know this?

Not at all

Perfectly

ORF scanning: upstream regulatory sequences

Study These Flashcards

Locate where genes begin using distinct sequence features (e.g. recognition signals for DNA-binding proteins). Regulatory sequences are variable and difficult to incorporate into gene prediction algorithms)

Best Ab initio methods

Study These Flashcards

Based on HMMs, a machine learning approach that takes sequences and encodes them in a statistical framework

Examples of Ab initio methods

Study These Flashcards

GENSCAN, HMMgene, GeneMark

What is GenScan?

Study These Flashcards

GenScan identifies complete intron/exon structures of genes in genomic DNA and predicts multiple genes, partial and complete genes. It uses HMM to model gene structure and has separate HMMs for exons, introns, and intergenic regions. There are different parameters for regions with different GC content.

P values in GenScan

Study These Flashcards

P is the probability that the exon is correct.
When P>0.99, the exon is almost exactly correct.
0.50<=P<=0.99, the exon is correct most of the time.
P<0.50, not reliable

Sensitivity (nucleotide level accuracy)

Study These Flashcards

no. of correct exons/no. of actual exons

Specificity (nucleotide level accuracy)

Study These Flashcards

no. of correct exons/no. of predicted exons

Sensitivity (exon level accuracy)

Study These Flashcards

true prediction/(actual exons + missed exons)

Specificity (exon level accuracy)

Study These Flashcards

true prediction/(true prediction + false prediction)

Advantages of Ab initio gene prediction

- Good at predicting nucleotides (>90%) - Improve accuracy by combining methods - Easier for prokaryotes because there are no introns

Disadvantages of Ab initio gene prediction

- Moderate at finding exon boundaries (70-75% correct per exon) - Poor at predicting complete gene structure (<50%) - Need to identify upstream regulatory sequences

What did Rogic et al. (2002) publish

A way to improve gene prediction accuracy y combining GenScan and HMMgene. - OR returns regions predicted by either program (more sensitive, less specific), leading to overprediction - AND returns regions predicted by both programs (less sensitive, more specific), leading to underprediction - EUI (exon union-intersection), OR if above a given significance threshold, AND otherwise

Examples of similarity-based gene prediction methods

AAT, GeneWise, SGP2, Rosetta

What is SGP2

Synthetic Gene Prediction, the query sequence is compared against sequences from the informant genome. The results of the comparison are used to modify exons predicted by ab initio prediction. It uses an integrated approach.

How do we confirm putative genes?

1. If the gene is matched to one or more ESTs from the same organism. 2. There is a similarity of the nucleotide/translated protein sequence to sequences in databases. 3. There is a match for the protein sequence in a secondary databank. 4. There is an association with predicted promoter sequences.

Problems with gene finding in general

- Imprecise or incomplete - Splicing incorrect - False positives - Failure to identify true genes - Doesn't account for PTHMs (ligands, glycosylation, methylation, peptide excision)

Gene Finding Flashcards

Week 2 Lecture 1 (31 cards)