5. Genome annotation Flashcards
Repeats & mobile elements in genome annotation - how to deal with them?
identify repeats to mask them (or to study them)
Left unmasked, repeats can seed millions of spurious BLAST alignments, producing false evidence for gene annotations.
approaches:
* library-based approaches: compare genome sequence to a library of
known repeats
* signature-based approaches: search for signatures of transposable elements: (LTR, key structural proteins or enzymes, etc)
* (de novo approaches: compare a genome sequence with itself; search for multiple occurrences of k-mers)
Two types of genome annotation?
structural, functional
What is structural genome annotation?
structural: where are the protein-coding genes?
‘Structural’ - process of identifying genes and their intron–exon structures. (paper)
what is the predicted phenotypic effect of the variant?
What is functional genome annotation?
(less important for our course)
functional: what is the function of the predicted genes and genetic elements?
‘Functional’ - is the process of attaching meta-data such as gene ontology terms to structural annotations (paper)
What are the different approaches to structural genome annotation?
approaches
* intrinsic, ab initio, de novo (only use query)
* extrinsic, homology/evidence-based (use other sequences)
* hybrid / combined / pipelines
What are the three steps for an intrinsic approach to structural genome annotation?
- collect appropriate training data
- build statistical model based on the training data
- apply model to the newly assembled genome to predict locations of protein-coding genes
Intrinsic approach to genome annotation:
What is appropriate training data?
- genes from the species to be annotated
- easy for “first generation” (eukaryotic) genomes
- much more difficult for “second generation” (eukaryotic) genomes
What statistical models can be used for intrinsic approaches?
- Hidden Markov Models - important for our course
- (Bayesian approaches)
- (Machine Learning)
Hidden Markov Models
How are they used in genome annotation?
Under what grouping of approaches does this belong?
given a DNA sequence, we want to know:
* where does it most likely contain genes?
* what probability is associated with this result?
Training data (genes from species to be annotated) used to build HMM, which can then be applied to a newly assembled genome to predict location of protein-coding genes
an intrinsic approach
Prokaryotic gene prediction (intrinsic)
properties of prokaryotic genome?
properties
* mostly intron-less genes
* average of 1000 nt per gene (ORF)
* translation start and stop codons for each gene
* some nt biases in coding vs. non-coding regions
Prokaryotic gene prediction (intrinsic)
using these as a basis for prediction will not work for …?
- small genes
- partial sequences, incomplete genes
- sequencing errors
Types of HMMs for gene prediction
standard HMMs
* each hidden state emits one nt
generalized HMMs (important for course)
(HMMs with duration)
* each hidden state emits a string of nucleotides
can include
* one strand, both strands
* typical & atypical genes
challenges of gene finding for eukaryotic genomes?
- definition of a gene (gene signals)
- overlapping genes
- very long or very short genes / exons / introns
- alternative biological processing
- alternative splicing
- alternative polyadenylation
- alternative initiation of transcription
- alternative initiation of translation
- propagation of (annotation) errors in databases
- sequencing errors
- incomplete genes on short contigs/scaffolds
- contamination
How is splicing relevant to genome annotation?
Nearly all multi-exon human genes are alternatively spliced
* basic alternative splicing patterns vs
* complex alternative splicing patterns
Challenges of using eukaryotic gene signals in intrinsic approaches to genome annotation
Eukaryotic gene signals, a lot of variance –>
not all genes contain the described signals
the signals can occur outside of gene context
What are some gene signals for Eukaryotes?
- intron-exon structure
- not all exons contain start/stop codons
- exons are usually smaller, introns can be quite large
- splice sites
- donor site: GT
- acceptor site: AG
- transcription signals (CAP, TATA box, termination)
- translational signals (Kozak signal, termination)
challenges of gene finding for eukaryotic genomes
Prediction requirements for exons?
- exons cannot overlap
- adjacent exons must maintain an open reading
frame (ORF)
What are some Eukaryotic content/compositional features?
nucleotide composition
* biases: GC, nts, dinucleotides, hexamers, …
* different in different lineages / species
* different in introns vs exons vs intergenic regions
* different in highly expressed genes
* …
example: codon usage
* codon bias: codons are not used randomly; varies by lineage & species
- e.g., Arginine: CGT, CGC, CGA, CGG, AGA, AGG
Some challenges for intrinsic gene finding approaches in eukaryotic genes?
Intrinsic gene finding approaches
* structure of eukaryotic genes (e.g., intron-exon structure)
* signals in the sequences (e.g., splice sites, transcriptional and translational signals)
* content statistics and sensors (e.g, nucleotide composition, hexamers, codon usage)
What is not available for non-model organisms that makes intrinsic approaches to genome annotation difficult?
➡requires species- or lineage-specific training data
- volume & variety
- not available for many non-model organisms
What is the name of one HMM approach to genome annotation?
What are the states? What do they emit?
Genscan (1997)
States: components (lengths, composition, signals) of a gene
states emit a sequence of variable length according to the state’s sequence composition
What is sensitivity in the context of gene prediction?
Sensitivity: ability to include correct predictions
Sn
how many nucleotides/exons/genes does the method predict correctly?
What is specificity in the context of gene prediction?
Specificity: ability to exclude incorrect predictions
Sp
how much of the prediction of
nucleotides/exons/genes is true?
Explain extrinsic gene prediction in eukaryotes
What is extrinsic?
What can these approaches not identify?
Genes are found based on
similarity with transcripts (RNA-seq, long-read RNA-seq)
* can be used to identify exons and splicing patterns
* problems: paralogs, placing short reads, snapshot!
* preferred & very common approach
similarity with proteins
* close relative’s proteome (e.g., from UniProtKB)
* advantage: may provide information about function
* problems: domains, UTRs, lineage-specific genes?
cannot identify new genes
Can you use general purpose sequence similarity tools for extrinsic prediction of eukaryotic genes?
Can’t use general purpose sequence similarity tools!
no awareness of splice sites, start/stop, reading frames, etc!
specific software required!
What are some dynamic approaches for gene prediction?
Hybrid methods
choosers
combiners
pipelines
What do hybrid methods for gene prediction do?
use both intrinsic and extrinsic methods to predict genes
What do combiners do for gene prediction? give an example
combine independent predictions (EvidenceModeler)
Example of a pipeline method for gene prediction?
eg MAKER
Maker pipeline method for gene prediction: what data is used?
intrinsic: newly sequenced genome (repeats masked)
extrinsic: evidence:
- RNAseq and/or
- protein sequences from selected lineages
Maker pipeline method for gene prediction: What are the initial gene predictions based on?
homology
Maker pipeline method for gene prediction: steps
EXAM QUESTION
steps of a procedure combining intrinsic and extrinsic annotation for a non-model organism’s newly assembled genome without available RNA-seq data (2022)
- initial gene predictions (extrinsic)
- extraction of species-specific content statistics (intrinsic)
- generation of species-specific HMMs (intrinsic)
- (refined) gene predictions (intrinsic).. repeat steps 2-4 ca 2 times
- final gene predictions
How are predicted genes evaluated?
- quantify expected gene content for a given lineage
- ineage-specific near-universal single-copy orthologs (BUSCO, https://busco.ezlab.org/)
- how many are complete, partial, absent, duplicated?
What is BUSCO
Benchmarking Universal Single-Copy Orthologs) scores
- look for presence/absence of highly conserved genes in assembly.
- aim highest percentage of genes identified in assembly
- BUSCO complete score above 95% considered good
EXAM QUESTION
Describe the main points of intrinsec and extrinsic genome annotation, and a disadvantage for each one of them. (2019)
What does intrinsic and extrinsic mean in gene prediction, explain/describe methods and name disadvantages (2020)
Intrinsic: just use query data (genome/s), build statistical model eg HMM
Extrinsic: comparative - use other data eg rna sequences, orproteins from other species/lineages
Disadvantage
intrinsic: need a lot of training data, not possible for non-model organisms
extrinsic: difficult to predict new genes