Genome annotation Flashcards

1
Q

What are the objectives of genome annotation?

A
  • Identify all of the possible genes and features within the genome
  • Assign functions to as many genes as possible - This should include variable splice sites that may mean the gene has multiple products
  • Identify and describe regulatory features and their functions
  • Identify all other features, including repeats
  • comparative studies between genomes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are some issues you can face during genome annotation?

A

Annotating genomes is a computationally intensive process, both in terms of processing and data storage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some characteristics of genomes that might lead to errors during the annotation process?

A
  • Genomes are extremely large
  • Virtually all features are context-specific
  • Majority of the sequence inside a genome may not correspond to any known
    • introns
    • completely putative coding regions
    • pseudogenes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is annotation by simple metrics?

A

this is the simplest level of annotation and provides general information about the genome

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are some things you can annotate with simple metrics annotation?

A
  • base confidence scores
  • rolling window G+C%
  • Di and tri nucleotide composition bias - how often do you get two nucleotides one after the other; are they distributed like that by chance or maybe the distribution is not that likely just by random distribution; any attern tells you that there is a selective pressure on that sequence and it likely does something
  • codon use
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the average GC content in the genome and in genes (in humans?

A
  • approx 38% - but thats an average and can vary significantly throughout the genome
  • In genes the GC is approx 45-50% and more uniformly distributed than in the genome.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the relationship between the GC content and gene density?

A

Regions of high GC content (62-68%) have higher relative gene density than regions of lower GC content.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the relationship between exon and intron length and the GC content?

A
  • Exon length is relatively uniform with respect to GC content
  • Intron length decreases dramatically in regions of high GC content:
    • If GC content is around 30%, average intron length is 2300 bases
    • If the GC content is around 65%, average intron length is 300 bases
  • marker to where the genes are gonna be
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are some other characteristics we can measure apart from the GC content?

A
  • Most simple is dinucleotide composition/bias - %AT vs. %GC
  • Dicodon counts can be more informative - (frequency of occurrence of successive
    codon pairs)
    AA, AC, AG, AT, CA, CC etc.
  • Measure of 3rd base periodicity (tendency of same nucleotide to be found at particular distances e.g. n, n+2, n+5, n+7
  • Length and occurrence of Open Reading Frames (ORFs) - stretches between stop codons
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is an ab initio prediction?

A

prediction just from the sequence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Why is gene prediction easier in bacteria than in eukaryotes?

A
  • No introns
  • Smaller intergenic regions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do bacterial gene finders work and is it a reasonable assumption?

A
  • Bacteria gene finders often just look for the largest open reading frames (ORFs), those above a certain size (approx. 300bp) and consider them to be real genes
  • This is a reasonable assumption in genomes with a low GC contents
  • It is a problem for those with high GC contents
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Why is looking at the length of ORFs not enough in genomes with high GC contents?

A
  • With a lack of A and T in these genomes there are far fewer stop codons
  • Long ORFs occur simply by chance in high GC genomes, many of which are not genuine genes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Prodigal?

A
  • Prodigal uses additional log-likelihood information to predict genes and not just ORF length
  • A relatively simple system that follows principles of KISS (Keep It Simple, Stupid)
  • High accuracy (>90%)
  • Good prformance in high GC genomes: Over a 90% perfect match (5’+3’) to the Pseudomonas aeruginosa curated
    annotations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the steps in Prodigal?

A
  1. Constructing of a training set fro protein coding
  2. Building log-likelihood coding statistics from the training data
  3. Sharpening coding scores
  4. Length factor too coding
  5. Iterative start training
  6. Final dynamic programming (trying every possible combination)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Describe 1. Constructing of a training set fro protein coding
(prodigal)

A
  1. Prodigal examines all ORFs and looks for a bias for G or C in the 1st, 2nd, and 3rd positions of each codon
  2. It then builds gene models across the whole genome using this frame plot bias.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Describe 2. Building log-likelihood coding statistics from the training data

A
  1. There is a preference in exons compared to introns for particular 6-nucleotide sequences (dicodons). Prodigal stores dicodon statistics for all the genes in its initial model. Scored as a log-likelihood (how often do you find them and how often would you find them if they were distributed by chance) of signal to background.
  2. we are looking for biases - if you find a bias then you should expect to see the same bias in all open reading frames
  3. Every potential gene in the genome (all possible starts and stops) is scored.
  4. These 2 steps are all Prodigal uses in the way of coding scores.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Describe sharpening coding scores

A
  1. In the next step it penalizes all potential start candidates that lie downstream from a higher-scoring start. A penalty is assigned representing the bypassing of a good coding region.
  2. longer ORFs have more credibility
  3. Example: gene 3701-4000 has a score of 100 and gene 3763-4000 has a score of 75. The score of gene 3763-4000 is adjusted to be to be 75 minus the coding not selected (25), therefore 50.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Describe Length factor to coding

A

A length factor is added to the coding score. This factor is higher in high GC genomes, and lower in low GC genomes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Describe iterative start training

A
  1. For every open reading frame containing a gene with a coding score above a certain threshold, the translation initiation site with the highest coding score is recorded
  2. These starts are examined for ATG/GTG/TTG frequency and ribosomal binding site (RBS / Shine-Dalgarno) motifs.
  3. The starts are then rescored based on these discoveries, and the new set of starts with the highest score in each ORF is selected
  4. Shine-Dalgarno - ~10 nucleotides upstream from the start codon (consensus TAAGGAG) – corresponds to a pentamer in 16S rRNA near 3’ end
  5. The start trainer iterates until the set of “best starts” no longer changes (usually only a few iterations).
  6. This final set of “best starts” is used as the training set for start scoring
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Describe final dynamic programming (trying every possible combination

A
  1. Final dynamic programming is performed over all start-stop pairs in the genome
  2. Each potential gene’s score is the sum of its start score and its coding score
  3. Small overlap allowed between two genes on the same strand, and a greater amount of overlap is allowed for 3’ ends of two genes on opposite strands
  4. Final predictions determined
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What genome does Prodigal use for training?

A

it only uses the current genome it analyses for training! not external data!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Why are eukaryotic genes more difficult to annotate?

A
  • multi exon genes - introns
  • multiple transcripts
  • large intergenic regions
24
Q

What can you use to describe the functional sites in an eukaryotic genome?

A
  • position weight matrix can be used - a score given to each possible nucleotide at each possible postion
  • then for any sequence the scores are summed to give a score for that sequence as a potential site
  • need multiple examples, produce MSA and create PWM
  • PWM can be used for any functional site including transcription binding sites
  • The position specific scoring matrix (PSSM) used by PSI-BLAST is a PWM
25
Q

Give an example of splice sites being recognised using position weight matrix

A
  • With the vast bulk of pre mRNA introns, the first 2 nucleotides of the intron sequence are
    5’- GU - 3’ and the last 2 are 5’- AG -3’ – hence they are called GU-AG introns
  • All members of this class are spliced the same way
  • These conserved motifs were recognised early on as important in the splicing
    process but are actually part of longer conserved consensus sequences that span the 5’
    and 3’ splice sites
26
Q

What about PWM and alternative splicing?

A
  • Other conserved features add context and can be looked for - e.g. polypyrimidine tracts just upstream of the 3’ end of the intron in most higher eukaryotes
  • Yeast introns mostly have an invariant 5’ UACUAAC-3’ 18-140bp upstream of 3’ splice instead
  • Position Weight Matrices (PWM) have been compiled to describe splice sites in different taxonomic groups - they differ
  • However, analysis of the splice junction itself gives low specificity – probably due to multiple splicing mechanisms and regulated alternative splicing
  • As alternative splicing is still relatively poorly documented still don’t know just how bad predictions are
27
Q

What do ab initio programs look for in eukaryote gene predictions?

A
  • gene prediction can be accomplished by using HMM-like structure with the following features
    • Regular gene structure
    • Exon/intron lengths
    • Nucleotide composition
    • Motifs at the boundaries of exons, introns, etc.
    • Start codon, stop codon, splice sites
    • Patterns of conservation
28
Q

What do programs such as Genscan train on?

A
  • Gene prediction programs such as Genscan use known genes as a training set to build a species or taxonomic specific gene model based on the features
  • Can then search unknown sequence to look for similar patterns and predict gene structures
29
Q

How is the HMM structured in GENSCAN?

A
  • The HMM for GENSCAN is structured as follows.
  • Start in the intergenic region, N based on observations in data.
  • The next possible state is P, the promoter state.
  • Next has to be F, the 5’ untranslated region.
  • Next is either Esngl (single exon gene) or Einit (the first exon of a multiple-exon gene). The probability of moving to these states is based on the training data.
  • Final states are T (3’ untranslated region) and then A (polyadenylated tail) then back to N.
  • From Einit there are three different intron states corresponding to how the reading frame is shifted. There are also three different exon states that can follow the intron states.
30
Q

How can you use length distribution functions for initial, internal and terminal exons?

A
  • Introns show geometric distribution but have a minimum size
  • Exons show a normal distribution but internal exons show steep dropoff after 300bp, unlike initial and terminal
  • Therefore, can use length distribution functions for initial, internal, and terminal exons and also for single-exon genes
31
Q

Conserved patterns in the donor splice sites

A

There are conserved patterns at the donor and acceptor sites and also significant dependencies among non-adjacent positions at the donor site
donor site particularly interesting because you don’t get just probability of these nucleotides being there but there is an interdependence between them and you can model thet → Maximal dependence decomposition

32
Q

What is maximal dependance decomospotion?

A
  • The MDD models the dependencies of nucleotides at different positions
  • Used by Genscan to predict the donor site
  • A PWM or HMM does not capture these dependencies; HMM willl tell you something if the nucleotides are next to eachh other but noot if they are not adjascent
  • Requires a large number of sequences to construct
32
Q

What is maximal dependance decomospotion?

A
  • The MDD models the dependencies of nucleotides at different positions
  • Used by Genscan to predict the donor site
  • A PWM or HMM does not capture these dependencies; HMM willl tell you something if the nucleotides are next to eachh other but noot if they are not adjascent
  • Requires a large number of sequences to construct
33
Q

What is the conserved pattern in splice suites?

A

The two nucleotides immediately following the donor splice site on the intron are almost always GT. The two nucleotides immediately preceding the acceptor splice site on the intron are almost always AG

34
Q

What are the wright matrices used for?

A

The acceptor splice site (PWA – similar to PWM but assumes some dependency between adjacent positions)
- PolyA tail - the consensus is AATAAA (PWM)
- Translation start (12 base pairs) (PWM)
- Translation stop (3 base pairs, 1 of 3 stop codons according to observed frequency and then 3 nucleotides) (PWM)

35
Q

How does GENSCAN deal with promoter prediction?

A
  • 30% of promoters in eukaryotes lack a TATA signal so GENSCAN splits the model for prediction:
  • TATA containing promoter
    • Generated with probability 0.7
    • 15 bp TATA-box WMM and 8 bp cap site PWM
  • TATA-less promoter
    • Generated with probability 0.3
    • Modelled as intergenic-null regions of 40bp
36
Q

What happens if GENSCAN doesn’t.t find a promoter?

A

If the program doesn’t find a promoter then it usually doesn’t outrule the gene because the promoters are so difficult to find

Promoter prediction not required by Genscan to produce gene model

37
Q

What is the main difference between Prodigy and Genscan?

A

Prodigo vs Genscan - Prodigo trains itself on the genome it is annotating and Genscan can be trained on other genomes (known genes); predigo - predicted genes

38
Q

How do you do gene hunting by sequence homology?

A
  • Many useful tools for gene identification are based on sequence identity
  • you take a genome and you just BLAST it
39
Q

What is the assumption behind gene hunting by sequence homology?

A
  • Assumption is if 2 genes are (very) similar in sequence they will encode proteins with similar structure/function
  • Whilst not infallible – it can still give very useful results
  • Compare unknown sequence to sequences of known (or guessed) function by sequence alignment methods
  • Even if similar to protein of unknown function, the existence of similarity itself is strong evidence that sequence is protein-encoding
40
Q

What does BLAST search agains?

A
  • EST data from same species or close relative; EST high trhoughput sequencing of messanger RNAs; ESTs predominantly 5’ so they can tell yoou the first exon - very useful
  • SwissProt database
41
Q

What’s the issue with ab into and how can we solve it?

A

Ab initios can be a lot of false positives - not that much of an issue so by combining it with the other alignments you can get better results.

false positive > false negative - you don’t want to miss gene but its okay to over-predict.

Ab initio can give you a gene structure which you might not get from the homology search.

42
Q

What can be some difficulties identifying the first exon?

A
  • Initial exon occasionally predominantly UTR
  • Makes identification by BLAST homology difficult, particularly protein alignments
  • The coding nucleotides would not produce a significant BLAST alignment - only 2 aas = too small for BLAST - not accomon for this to happen
  • EST or RNA-seq data can help to resolve these issues or Ab initio
43
Q

What are some other applications of RNA seq?

A
  • RNA-Seq has applications beyond measuring gene expression levels, you are mapping back the mRNAs so it is telling you where the genes are
  • Assembly and mapping of the reads can also be used to identify genes within a genome
  • Transcriptome Profiling by RNA-Seq
    • RNA-Seq can also annotate variable transcripts
    • Blue - Reads that map to previously annotated UTRs, exons, and splice junctions
    • Green - Reads that map to novel expressed sequences, including alternative exons and
    corresponding splice junction sequences (indicated in red)
  • RNA-Seq allows detection of other novel features, such as fusion transcripts that map to an exon from one gene followed by an exon from another gene; fusion genes particularly common in cancer
  • It might occur as result of a translocation, deletion or chromosomal inversion
  • Example - PML-RAR protein associated with Acute Promyelocytic Leukemia
44
Q

How is a function assigned to a gene?

A
  • Main source of information of significant BLAST results – particularly SwissProt or RefSeqP
  • Other annotation sources include:GO (Gene Ontology) annotations
    Mass spec predictions
    Signal peptide prediction
    Transcription factorsDomain Prediction (InterProScan) - even if you can’t get a hit for a gene then youy can at least try to get some domains which will tell you something about the function hopefully
45
Q

Can we easily predict the promoter sequence?

A
  • The promoter is an information-rich signal BUT promoter prediction is still difficult
  • There are a number of programs that do it – based on libraries describing known transcription binding specificities together with some measures of promoter structure - but they don’t perform particularly well alone
  • Most ab initio gene prediction programs don’t just base their predictions on
    promoter structures although they often predict a promoter if possible
  • One method to predict promoters, and other regulatory sites, is phylogenetic footprinting
46
Q

What are thee two approaches to phylogenic foot printing?

A

Evolutionarily Conserved Genes – Multi Species
Co-expressed Genes – Single Species

47
Q

Explain phylogenic foot printing: Co-expressed Genes – Single Species

A
  • Genes that are expressed at the same time are likely to be controlled by the same regulatory elements – transcription factors and/or promoters
  • Predict regulatory regions by aligning the upstream regions of the co-expressed genes
  • Co-expressed data obtained by microarray or RNA-Seq
48
Q

Explain phylogenic foootprinting: Evolutionarily Conserved Genes – Multi Species

A
  • Predict regulatory regions, including promoters by aligning the upstream regions of evolutionarily conserved genes
  • More distantly related species may give better results due to greater mutations outside regulatory region
  • However, distantly related means greater overall mutation
  • theory - if the genes are conserved then the promoters should too to be maintained throughout evolution; if you have an evolutionary diverse species then after some time the promoter should start to stick out
49
Q

Should you use diverse populations in phylogenic foot printing?

A

the more diverse the better - more chances for the promoter to stick out

In theory pretty simple, in practice VERY complicated

50
Q

CHIP-Seq and MeDIP-seq:

A
  • ChIP-seq used to identify chromosome sequences where proteins have bound e.g. transcription factor binding site
  • ChIP-seq directly sequences the DNA, which can then be mapped back onto the genome for precise localisation
    Same approach used to identify DNA methylation sites (MeDIP-seq)
51
Q

What is the significance of repeats?

A
  • Evidence that the genome environment, including repeats, can be important for the regulation of gene expression
  • LINE, SINE and LTR elements comprise 37% of the rodent and 42% of the human genome
  • Exons of genes comprise only approximately 2% of sequence
  • LTR retrotransposons influence developmentally regulated expression of genes in mouse oocytes and preimplantation embryos
  • X chromosome has proportionately high level of LINE repeats and are imlicated in X-inactivation
  • A gibbon specific retrotransposon (3’-L1-AluS-VNTR-Alu-like-5’) thought to be responsible for ‘the genome plasticity of the gibbon lineage’.
52
Q

What is ensemble?

A
  • Annotated genomes available for multiple eukaryotes e.g. Human, Mus, Rattus,
    Drosophila, Fugu, Anopheles,
    C. elegans, Dog, Armadillo, Chimp, Bushbaby, Cat etc.
  • Ensembl pipeline used for primary annotation
  • Portable (if you have the compute power and a big enough problem to solve)
  • Stores data in MySQL database
  • Annotations accessible via the web
  • Multiple data mining interfaces
  • Now standardised repository for project annotations
  • Emphasis on inter-genome comparisons (compara)
53
Q

What is the ensemble annotation procedure?

A
  1. Place known same organism (e.g. human) genes onto the genome
  2. Place highly similar genes e.g. mus on genome (BLAST)
  3. Predict novel genes from ab initio methods backed up with supporting evidence from sequence similarity – only use ones confirmed by similarity to protein, cDNA or ESTs (uses Genscan)

First 2 stages are based around aligning PROTEINS to the genome DNA-DNA alignments don’t give translatable genes
Essential to align at the protein level allowing for frameshifts and splice sites to get accurate gene model

54
Q

Walk through the whole annotation procedure in ensembl

A