5. Genome annotation Flashcards

Question 1

Q

Repeats & mobile elements in genome annotation - how to deal with them?

Answer

A

identify repeats to mask them (or to study them)

Left unmasked, repeats can seed millions of spurious BLAST alignments, producing false evidence for gene annotations.

approaches:
* library-based approaches: compare genome sequence to a library of
known repeats
* signature-based approaches: search for signatures of transposable elements: (LTR, key structural proteins or enzymes, etc)
* (de novo approaches: compare a genome sequence with itself; search for multiple occurrences of k-mers)

Question 2

Q

Two types of genome annotation?

Answer

A

structural, functional

Question 3

Q

What is structural genome annotation?

Answer

A

structural: where are the protein-coding genes?
‘Structural’ - process of identifying genes and their intron–exon structures. (paper)

what is the predicted phenotypic effect of the variant?

Question 4

Q

What is functional genome annotation?

Answer

A

(less important for our course)

functional: what is the function of the predicted genes and genetic elements?
‘Functional’ - is the process of attaching meta-data such as gene ontology terms to structural annotations (paper)

Question 5

Q

What are the different approaches to structural genome annotation?

Answer

A

approaches
* intrinsic, ab initio, de novo (only use query)
* extrinsic, homology/evidence-based (use other sequences)
* hybrid / combined / pipelines

Question 6

Q

What are the three steps for an intrinsic approach to structural genome annotation?

Answer

A

collect appropriate training data
build statistical model based on the training data
apply model to the newly assembled genome to predict locations of protein-coding genes

Question 7

Q

Intrinsic approach to genome annotation:

What is appropriate training data?

Answer

A

genes from the species to be annotated
easy for “first generation” (eukaryotic) genomes
much more difficult for “second generation” (eukaryotic) genomes

Question 8

Q

What statistical models can be used for intrinsic approaches?

Answer

A

Hidden Markov Models - important for our course
(Bayesian approaches)
(Machine Learning)

Question 9

Q

Hidden Markov Models

How are they used in genome annotation?
Under what grouping of approaches does this belong?

Answer

A

given a DNA sequence, we want to know:
* where does it most likely contain genes?
* what probability is associated with this result?

Training data (genes from species to be annotated) used to build HMM, which can then be applied to a newly assembled genome to predict location of protein-coding genes

an intrinsic approach

Question 10

Q

Prokaryotic gene prediction (intrinsic)

properties of prokaryotic genome?

Answer

A

properties
* mostly intron-less genes
* average of 1000 nt per gene (ORF)
* translation start and stop codons for each gene
* some nt biases in coding vs. non-coding regions

Question 11

Q

Prokaryotic gene prediction (intrinsic)

using these as a basis for prediction will not work for …?

Answer

A

small genes
partial sequences, incomplete genes
sequencing errors

Question 12

Q

Types of HMMs for gene prediction

Answer

A

standard HMMs
* each hidden state emits one nt

generalized HMMs (important for course)
(HMMs with duration)
* each hidden state emits a string of nucleotides
can include
* one strand, both strands
* typical & atypical genes

Question 13

Q

challenges of gene finding for eukaryotic genomes?

Answer

A

definition of a gene (gene signals)
overlapping genes
very long or very short genes / exons / introns
alternative biological processing
- alternative splicing
- alternative polyadenylation
- alternative initiation of transcription
- alternative initiation of translation
propagation of (annotation) errors in databases
sequencing errors
incomplete genes on short contigs/scaffolds
contamination

Question 14

Q

How is splicing relevant to genome annotation?

Answer

A

Nearly all multi-exon human genes are alternatively spliced
* basic alternative splicing patterns vs
* complex alternative splicing patterns

Question 15

Q

Challenges of using eukaryotic gene signals in intrinsic approaches to genome annotation

Answer

A

Eukaryotic gene signals, a lot of variance –>
not all genes contain the described signals
the signals can occur outside of gene context

Question 16

Q

What are some gene signals for Eukaryotes?

Answer

A

intron-exon structure
- not all exons contain start/stop codons
- exons are usually smaller, introns can be quite large
splice sites
- donor site: GT
- acceptor site: AG
transcription signals (CAP, TATA box, termination)
translational signals (Kozak signal, termination)

Question 17

Q

challenges of gene finding for eukaryotic genomes

Prediction requirements for exons?

Answer

A

exons cannot overlap
adjacent exons must maintain an open reading
frame (ORF)

Question 18

Q

What are some Eukaryotic content/compositional features?

Answer

A

nucleotide composition
* biases: GC, nts, dinucleotides, hexamers, …
* different in different lineages / species
* different in introns vs exons vs intergenic regions
* different in highly expressed genes
* …

example: codon usage
* codon bias: codons are not used randomly; varies by lineage & species
- e.g., Arginine: CGT, CGC, CGA, CGG, AGA, AGG

Question 19

Q

Some challenges for intrinsic gene finding approaches in eukaryotic genes?

Answer

A

Intrinsic gene finding approaches
* structure of eukaryotic genes (e.g., intron-exon structure)
* signals in the sequences (e.g., splice sites, transcriptional and translational signals)
* content statistics and sensors (e.g, nucleotide composition, hexamers, codon usage)

Question 20

Q

What is not available for non-model organisms that makes intrinsic approaches to genome annotation difficult?

Answer

A

➡requires species- or lineage-specific training data
- volume & variety
- not available for many non-model organisms

Question 21

Q

What is the name of one HMM approach to genome annotation?

What are the states? What do they emit?

Answer

A

Genscan (1997)

States: components (lengths, composition, signals) of a gene

states emit a sequence of variable length according to the state’s sequence composition

Question 22

Q

What is sensitivity in the context of gene prediction?

Answer

A

Sensitivity: ability to include correct predictions

S_n

how many nucleotides/exons/genes does the method predict correctly?

Question 23

Q

What is specificity in the context of gene prediction?

Answer

A

Specificity: ability to exclude incorrect predictions

S_p

how much of the prediction of
nucleotides/exons/genes is true?

Question 24

Q

Explain extrinsic gene prediction in eukaryotes

What is extrinsic?

What can these approaches not identify?

Answer

A

Genes are found based on

similarity with transcripts (RNA-seq, long-read RNA-seq)
* can be used to identify exons and splicing patterns
* problems: paralogs, placing short reads, snapshot!
* preferred & very common approach
similarity with proteins
* close relative’s proteome (e.g., from UniProtKB)
* advantage: may provide information about function
* problems: domains, UTRs, lineage-specific genes?

cannot identify new genes

Question 25

Q

Can you use general purpose sequence similarity tools for extrinsic prediction of eukaryotic genes?

Answer

A

Can’t use general purpose sequence similarity tools!
no awareness of splice sites, start/stop, reading frames, etc!

specific software required!

Question 26

Q

What are some dynamic approaches for gene prediction?

Answer

A

Hybrid methods
choosers
combiners
pipelines

Question 27

Q

What do hybrid methods for gene prediction do?

Answer

A

use both intrinsic and extrinsic methods to predict genes

Question 28

Q

What do combiners do for gene prediction? give an example

Answer

A

combine independent predictions (EvidenceModeler)

Question 29

Q

Example of a pipeline method for gene prediction?

Question 30

Q

Maker pipeline method for gene prediction: what data is used?

Answer

A

intrinsic: newly sequenced genome (repeats masked)

extrinsic: evidence:
- RNAseq and/or
- protein sequences from selected lineages

Question 31

Q

Maker pipeline method for gene prediction: What are the initial gene predictions based on?

Question 32

Q

Maker pipeline method for gene prediction: steps

EXAM QUESTION

steps of a procedure combining intrinsic and extrinsic annotation for a non-model organism’s newly assembled genome without available RNA-seq data (2022)

Answer

A

initial gene predictions (extrinsic)
extraction of species-specific content statistics (intrinsic)
generation of species-specific HMMs (intrinsic)
(refined) gene predictions (intrinsic).. repeat steps 2-4 ca 2 times
final gene predictions

Question 33

Q

How are predicted genes evaluated?

Answer

A

quantify expected gene content for a given lineage
ineage-specific near-universal single-copy orthologs (BUSCO, https://busco.ezlab.org/)
how many are complete, partial, absent, duplicated?

Question 34

Q

What is BUSCO

Answer

A

Benchmarking Universal Single-Copy Orthologs) scores
- look for presence/absence of highly conserved genes in assembly.
- aim highest percentage of genes identified in assembly
- BUSCO complete score above 95% considered good

Question 35

Q

EXAM QUESTION

Describe the main points of intrinsec and extrinsic genome annotation, and a disadvantage for each one of them. (2019)

What does intrinsic and extrinsic mean in gene prediction, explain/describe methods and name disadvantages (2020)

Answer

A

Intrinsic: just use query data (genome/s), build statistical model eg HMM
Extrinsic: comparative - use other data eg rna sequences, orproteins from other species/lineages

Disadvantage
intrinsic: need a lot of training data, not possible for non-model organisms
extrinsic: difficult to predict new genes

Brainscape's Knowledge GenomeTM

5. Genome annotation Flashcards

Brainscape's Knowledge Genome^TM