5. Genome annotation Flashcards
Repeats & mobile elements in genome annotation - how to deal with them?
identify repeats to mask them (or to study them)
Left unmasked, repeats can seed millions of spurious BLAST alignments, producing false evidence for gene annotations.
approaches:
* library-based approaches: compare genome sequence to a library of
known repeats
* signature-based approaches: search for signatures of transposable elements: (LTR, key structural proteins or enzymes, etc)
* (de novo approaches: compare a genome sequence with itself; search for multiple occurrences of k-mers)
Two types of genome annotation?
structural, functional
What is structural genome annotation?
structural: where are the protein-coding genes?
‘Structural’ - process of identifying genes and their intron–exon structures. (paper)
what is the predicted phenotypic effect of the variant?
What is functional genome annotation?
(less important for our course)
functional: what is the function of the predicted genes and genetic elements?
‘Functional’ - is the process of attaching meta-data such as gene ontology terms to structural annotations (paper)
What are the different approaches to structural genome annotation?
approaches
* intrinsic, ab initio, de novo (only use query)
* extrinsic, homology/evidence-based (use other sequences)
* hybrid / combined / pipelines
What are the three steps for an intrinsic approach to structural genome annotation?
- collect appropriate training data
- build statistical model based on the training data
- apply model to the newly assembled genome to predict locations of protein-coding genes
Intrinsic approach to genome annotation:
What is appropriate training data?
- genes from the species to be annotated
- easy for “first generation” (eukaryotic) genomes
- much more difficult for “second generation” (eukaryotic) genomes
What statistical models can be used for intrinsic approaches?
- Hidden Markov Models - important for our course
- (Bayesian approaches)
- (Machine Learning)
Hidden Markov Models
How are they used in genome annotation?
Under what grouping of approaches does this belong?
given a DNA sequence, we want to know:
* where does it most likely contain genes?
* what probability is associated with this result?
Training data (genes from species to be annotated) used to build HMM, which can then be applied to a newly assembled genome to predict location of protein-coding genes
an intrinsic approach
Prokaryotic gene prediction (intrinsic)
properties of prokaryotic genome?
properties
* mostly intron-less genes
* average of 1000 nt per gene (ORF)
* translation start and stop codons for each gene
* some nt biases in coding vs. non-coding regions
Prokaryotic gene prediction (intrinsic)
using these as a basis for prediction will not work for …?
- small genes
- partial sequences, incomplete genes
- sequencing errors
Types of HMMs for gene prediction
standard HMMs
* each hidden state emits one nt
generalized HMMs (important for course)
(HMMs with duration)
* each hidden state emits a string of nucleotides
can include
* one strand, both strands
* typical & atypical genes
challenges of gene finding for eukaryotic genomes?
- definition of a gene (gene signals)
- overlapping genes
- very long or very short genes / exons / introns
- alternative biological processing
- alternative splicing
- alternative polyadenylation
- alternative initiation of transcription
- alternative initiation of translation
- propagation of (annotation) errors in databases
- sequencing errors
- incomplete genes on short contigs/scaffolds
- contamination
How is splicing relevant to genome annotation?
Nearly all multi-exon human genes are alternatively spliced
* basic alternative splicing patterns vs
* complex alternative splicing patterns
Challenges of using eukaryotic gene signals in intrinsic approaches to genome annotation
Eukaryotic gene signals, a lot of variance –>
not all genes contain the described signals
the signals can occur outside of gene context