5. Genome annotation Flashcards

1
Q

Repeats & mobile elements in genome annotation - how to deal with them?

A

identify repeats to mask them (or to study them)

Left unmasked, repeats can seed millions of spurious BLAST alignments, producing false evidence for gene annotations.

approaches:
* library-based approaches: compare genome sequence to a library of
known repeats
* signature-based approaches: search for signatures of transposable elements: (LTR, key structural proteins or enzymes, etc)
* (de novo approaches: compare a genome sequence with itself; search for multiple occurrences of k-mers)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Two types of genome annotation?

A

structural, functional

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is structural genome annotation?

A

structural: where are the protein-coding genes?
‘Structural’ - process of identifying genes and their intron–exon structures. (paper)

what is the predicted phenotypic effect of the variant?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is functional genome annotation?

A

(less important for our course)

functional: what is the function of the predicted genes and genetic elements?
‘Functional’ - is the process of attaching meta-data such as gene ontology terms to structural annotations (paper)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the different approaches to structural genome annotation?

A

approaches
* intrinsic, ab initio, de novo (only use query)
* extrinsic, homology/evidence-based (use other sequences)
* hybrid / combined / pipelines

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the three steps for an intrinsic approach to structural genome annotation?

A
  1. collect appropriate training data
  2. build statistical model based on the training data
  3. apply model to the newly assembled genome to predict locations of protein-coding genes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Intrinsic approach to genome annotation:

What is appropriate training data?

A
  • genes from the species to be annotated
  • easy for “first generation” (eukaryotic) genomes
  • much more difficult for “second generation” (eukaryotic) genomes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What statistical models can be used for intrinsic approaches?

A
  • Hidden Markov Models - important for our course
  • (Bayesian approaches)
  • (Machine Learning)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Hidden Markov Models

How are they used in genome annotation?
Under what grouping of approaches does this belong?

A

given a DNA sequence, we want to know:
* where does it most likely contain genes?
* what probability is associated with this result?

Training data (genes from species to be annotated) used to build HMM, which can then be applied to a newly assembled genome to predict location of protein-coding genes

an intrinsic approach

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Prokaryotic gene prediction (intrinsic)

properties of prokaryotic genome?

A

properties
* mostly intron-less genes
* average of 1000 nt per gene (ORF)
* translation start and stop codons for each gene
* some nt biases in coding vs. non-coding regions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Prokaryotic gene prediction (intrinsic)

using these as a basis for prediction will not work for …?

A
  • small genes
  • partial sequences, incomplete genes
  • sequencing errors
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Types of HMMs for gene prediction

A

standard HMMs
* each hidden state emits one nt

generalized HMMs (important for course)
(HMMs with duration)
* each hidden state emits a string of nucleotides
can include
* one strand, both strands
* typical & atypical genes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

challenges of gene finding for eukaryotic genomes?

A
  • definition of a gene (gene signals)
  • overlapping genes
  • very long or very short genes / exons / introns
  • alternative biological processing
    • alternative splicing
    • alternative polyadenylation
    • alternative initiation of transcription
    • alternative initiation of translation
  • propagation of (annotation) errors in databases
  • sequencing errors
  • incomplete genes on short contigs/scaffolds
  • contamination
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How is splicing relevant to genome annotation?

A

Nearly all multi-exon human genes are alternatively spliced
* basic alternative splicing patterns vs
* complex alternative splicing patterns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Challenges of using eukaryotic gene signals in intrinsic approaches to genome annotation

A

Eukaryotic gene signals, a lot of variance –>
not all genes contain the described signals
the signals can occur outside of gene context

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are some gene signals for Eukaryotes?

A
  • intron-exon structure
    • not all exons contain start/stop codons
    • exons are usually smaller, introns can be quite large
  • splice sites
    • donor site: GT
    • acceptor site: AG
  • transcription signals (CAP, TATA box, termination)
  • translational signals (Kozak signal, termination)
17
Q

challenges of gene finding for eukaryotic genomes

Prediction requirements for exons?

A
  • exons cannot overlap
  • adjacent exons must maintain an open reading
    frame (ORF)
18
Q

What are some Eukaryotic content/compositional features?

A

nucleotide composition
* biases: GC, nts, dinucleotides, hexamers, …
* different in different lineages / species
* different in introns vs exons vs intergenic regions
* different in highly expressed genes
* …

example: codon usage
* codon bias: codons are not used randomly; varies by lineage & species
- e.g., Arginine: CGT, CGC, CGA, CGG, AGA, AGG

19
Q

Some challenges for intrinsic gene finding approaches in eukaryotic genes?

A

Intrinsic gene finding approaches
* structure of eukaryotic genes (e.g., intron-exon structure)
* signals in the sequences (e.g., splice sites, transcriptional and translational signals)
* content statistics and sensors (e.g, nucleotide composition, hexamers, codon usage)

20
Q

What is not available for non-model organisms that makes intrinsic approaches to genome annotation difficult?

A

➡requires species- or lineage-specific training data
- volume & variety
- not available for many non-model organisms

21
Q

What is the name of one HMM approach to genome annotation?

What are the states? What do they emit?

A

Genscan (1997)

States: components (lengths, composition, signals) of a gene

states emit a sequence of variable length according to the state’s sequence composition

22
Q

What is sensitivity in the context of gene prediction?

A

Sensitivity: ability to include correct predictions

Sn

how many nucleotides/exons/genes does the method predict correctly?

23
Q

What is specificity in the context of gene prediction?

A

Specificity: ability to exclude incorrect predictions

Sp

how much of the prediction of
nucleotides/exons/genes is true?

24
Q

Explain extrinsic gene prediction in eukaryotes

What is extrinsic?

What can these approaches not identify?

A

Genes are found based on

similarity with transcripts (RNA-seq, long-read RNA-seq)
* can be used to identify exons and splicing patterns
* problems: paralogs, placing short reads, snapshot!
* preferred & very common approach
similarity with proteins
* close relative’s proteome (e.g., from UniProtKB)
* advantage: may provide information about function
* problems: domains, UTRs, lineage-specific genes?

cannot identify new genes

25
Q

Can you use general purpose sequence similarity tools for extrinsic prediction of eukaryotic genes?

A

Can’t use general purpose sequence similarity tools!
no awareness of splice sites, start/stop, reading frames, etc!

specific software required!

26
Q

What are some dynamic approaches for gene prediction?

A

Hybrid methods
choosers
combiners
pipelines

27
Q

What do hybrid methods for gene prediction do?

A

use both intrinsic and extrinsic methods to predict genes

28
Q

What do combiners do for gene prediction? give an example

A

combine independent predictions (EvidenceModeler)

29
Q

Example of a pipeline method for gene prediction?

A

eg MAKER

30
Q

Maker pipeline method for gene prediction: what data is used?

A

intrinsic: newly sequenced genome (repeats masked)

extrinsic: evidence:
- RNAseq and/or
- protein sequences from selected lineages

31
Q

Maker pipeline method for gene prediction: What are the initial gene predictions based on?

A

homology

32
Q

Maker pipeline method for gene prediction: steps

EXAM QUESTION

steps of a procedure combining intrinsic and extrinsic annotation for a non-model organism’s newly assembled genome without available RNA-seq data (2022)

A
  1. initial gene predictions (extrinsic)
  2. extraction of species-specific content statistics (intrinsic)
  3. generation of species-specific HMMs (intrinsic)
  4. (refined) gene predictions (intrinsic).. repeat steps 2-4 ca 2 times
  5. final gene predictions
33
Q

How are predicted genes evaluated?

A
  • quantify expected gene content for a given lineage
  • ineage-specific near-universal single-copy orthologs (BUSCO, https://busco.ezlab.org/)
  • how many are complete, partial, absent, duplicated?
34
Q

What is BUSCO

A

Benchmarking Universal Single-Copy Orthologs) scores
- look for presence/absence of highly conserved genes in assembly.
- aim highest percentage of genes identified in assembly
- BUSCO complete score above 95% considered good

35
Q

EXAM QUESTION

Describe the main points of intrinsec and extrinsic genome annotation, and a disadvantage for each one of them. (2019)

What does intrinsic and extrinsic mean in gene prediction, explain/describe methods and name disadvantages (2020)

A

Intrinsic: just use query data (genome/s), build statistical model eg HMM
Extrinsic: comparative - use other data eg rna sequences, orproteins from other species/lineages

Disadvantage
intrinsic: need a lot of training data, not possible for non-model organisms
extrinsic: difficult to predict new genes