Genome annotation Flashcards

Question

Give an example of splice sites being recognised using position weight matrix

Answer 1

- With the vast bulk of pre mRNA introns, the first 2 nucleotides of the intron sequence are 5’- **GU** - 3’ and the last 2 are 5’- **AG** -3’ – hence they are called **GU-AG** introns - All members of this class are spliced the same way - These conserved motifs were recognised early on as important in the splicing process but are actually part of longer conserved consensus sequences that span the 5’ and 3’ splice sites

Answer 2

- Other conserved features add context and can be looked for - e.g. polypyrimidine tracts just upstream of the 3’ end of the intron in most higher eukaryotes - Yeast introns mostly have an invariant 5’ UACUAAC-3’ 18-140bp upstream of 3’ splice instead - Position Weight Matrices (PWM) have been compiled to describe splice sites in different taxonomic groups - they differ - However, analysis of the splice junction itself gives low specificity – probably due to multiple splicing mechanisms and regulated alternative splicing - As alternative splicing is still relatively poorly documented still don’t know just how bad predictions are

Answer 3

- gene prediction can be accomplished by using HMM-like structure with the following features - Regular gene structure - Exon/intron lengths - Nucleotide composition - Motifs at the boundaries of exons, introns, etc. - Start codon, stop codon, splice sites - Patterns of conservation

Answer 4

- Gene prediction programs such as Genscan use known genes as a training set to build a species or taxonomic specific gene model based on the features - Can then search unknown sequence to look for similar patterns and predict gene structures

Answer 5

- The HMM for GENSCAN is structured as follows. - Start in the intergenic region, N based on observations in data. - The next possible state is P, the promoter state. - Next has to be F, the 5’ untranslated region. - Next is either Esngl (single exon gene) or Einit (the first exon of a multiple-exon gene). The probability of moving to these states is based on the training data. - Final states are T (3’ untranslated region) and then A (polyadenylated tail) then back to N. - From Einit there are three different intron states corresponding to how the reading frame is shifted. There are also three different exon states that can follow the intron states.

Answer 6

- Introns show geometric distribution but have a minimum size - Exons show a normal distribution but internal exons show steep dropoff after 300bp, unlike initial and terminal - Therefore, can use length distribution functions for initial, internal, and terminal exons and also for single-exon genes

Answer 7

There are conserved patterns at the donor and acceptor sites and also significant dependencies among non-adjacent positions at the donor site donor site particularly interesting because you don’t get just probability of these nucleotides being there but there is an interdependence between them and you can model thet → Maximal dependence decomposition

Answer 8

- The MDD models the dependencies of nucleotides at different positions - Used by Genscan to predict the donor site - A PWM or HMM does not capture these dependencies; HMM willl tell you something if the nucleotides are next to eachh other but noot if they are not adjascent - Requires a large number of sequences to construct

Answer 9

- The MDD models the dependencies of nucleotides at different positions - Used by Genscan to predict the donor site - A PWM or HMM does not capture these dependencies; HMM willl tell you something if the nucleotides are next to eachh other but noot if they are not adjascent - Requires a large number of sequences to construct

Answer 10

The two nucleotides immediately following the donor splice site on the intron are almost always GT. The two nucleotides immediately preceding the acceptor splice site on the intron are almost always AG

Answer 11

The acceptor splice site (PWA – similar to PWM but assumes some dependency between adjacent positions) - PolyA tail - the consensus is AATAAA (PWM) - Translation start (12 base pairs) (PWM) - Translation stop (3 base pairs, 1 of 3 stop codons according to observed frequency and then 3 nucleotides) (PWM)

Answer 12

- 30% of promoters in eukaryotes lack a TATA signal so GENSCAN splits the model for prediction: - TATA containing promoter - Generated with probability 0.7 - 15 bp TATA-box WMM and 8 bp cap site PWM - TATA-less promoter - Generated with probability 0.3 - Modelled as intergenic-null regions of 40bp

Answer 13

If the program doesn’t find a promoter then it usually doesn’t outrule the gene because the promoters are so difficult to find Promoter prediction not required by Genscan to produce gene model

Answer 14

Prodigo vs Genscan - Prodigo trains itself on the genome it is annotating and Genscan can be trained on other genomes (known genes); predigo - predicted genes

Answer 15

- Many useful tools for gene identification are based on sequence identity - you take a genome and you just BLAST it

Answer 16

- Assumption is if 2 genes are (very) similar in sequence they will encode proteins with similar structure/function - Whilst not infallible – it can still give very useful results - Compare unknown sequence to sequences of known (or guessed) function by sequence alignment methods - Even if similar to protein of unknown function, the existence of similarity itself is strong evidence that sequence is protein-encoding

Answer 17

- EST data from same species or close relative; EST high trhoughput sequencing of messanger RNAs; ESTs predominantly 5’ so they can tell yoou the first exon - very useful - SwissProt database

Answer 18

Ab initios can be a lot of false positives - not that much of an issue so by combining it with the other alignments you can get better results. false positive > false negative - you don’t want to miss gene but its okay to over-predict. Ab initio can give you a gene structure which you might not get from the homology search.

Answer 19

- Initial exon occasionally predominantly UTR - Makes identification by BLAST homology difficult, particularly protein alignments - The coding nucleotides would not produce a significant BLAST alignment - only 2 aas = too small for BLAST - not accomon for this to happen - EST or RNA-seq data can help to resolve these issues or Ab initio

Answer 20

- RNA-Seq has applications beyond measuring gene expression levels, you are mapping back the mRNAs so it is telling you where the genes are - Assembly and mapping of the reads can also be used to identify genes within a genome - **Transcriptome Profiling by RNA-Seq** - RNA-Seq can also annotate variable transcripts - **Blue** - Reads that map to previously annotated UTRs, exons, and splice junctions - **Green** - Reads that map to novel expressed sequences, including alternative exons and corresponding splice junction sequences (indicated in red) - RNA-Seq allows detection of other novel features, such as fusion transcripts that map to an exon from one gene followed by an exon from another gene; fusion genes particularly common in cancer - It might occur as result of a translocation, deletion or chromosomal inversion - Example - PML-RAR protein associated with Acute Promyelocytic Leukemia

Answer 21

- Main source of information of significant BLAST results – particularly SwissProt or RefSeqP - Other annotation sources include: GO (Gene Ontology) annotations Mass spec predictions Signal peptide prediction Transcription factors Domain Prediction (InterProScan) - even if you can’t get a hit for a gene then youy can at least try to get some domains which will tell you something about the function hopefully

Answer 22

- The promoter is an information-rich signal **BUT** promoter prediction is still difficult - There are a number of programs that do it – based on libraries describing known transcription binding specificities together with some measures of promoter structure - but they don’t perform particularly well alone - Most *ab initio* gene prediction programs don’t just base their predictions on promoter structures although they often predict a promoter if possible - One method to predict promoters, and other regulatory sites, is phylogenetic footprinting

Answer 23

Evolutionarily Conserved Genes – Multi Species Co-expressed Genes – Single Species

Answer 24

- Genes that are expressed at the same time are likely to be controlled by the same regulatory elements – transcription factors and/or promoters - Predict regulatory regions by aligning the upstream regions of the co-expressed genes - Co-expressed data obtained by microarray or RNA-Seq

Answer 25

- Predict regulatory regions, including promoters by aligning the upstream regions of evolutionarily conserved genes - More distantly related species may give better results due to greater mutations outside regulatory region - However, distantly related means greater overall mutation - theory - if the genes are conserved then the promoters should too to be maintained throughout evolution; if you have an evolutionary diverse species then after some time the promoter should start to stick out

Answer 26

the more diverse the better - more chances for the promoter to stick out In theory pretty simple, in practice VERY complicated

Answer 27

- ChIP-seq used to identify chromosome sequences where proteins have bound e.g. transcription factor binding site - ChIP-seq directly sequences the DNA, which can then be mapped back onto the genome for precise localisation Same approach used to identify DNA methylation sites (MeDIP-seq)

Answer 28

- Evidence that the genome environment, including repeats, can be important for the regulation of gene expression - LINE, SINE and LTR elements comprise 37% of the rodent and 42% of the human genome - Exons of genes comprise only approximately 2% of sequence - LTR retrotransposons influence developmentally regulated expression of genes in mouse oocytes and preimplantation embryos - X chromosome has proportionately high level of LINE repeats and are imlicated in X-inactivation - A gibbon specific retrotransposon (3’-L1-AluS-VNTR-Alu-like-5’) thought to be responsible for ‘the genome plasticity of the gibbon lineage’.

Answer 29

- Annotated genomes available for multiple eukaryotes e.g. Human, *Mus, Rattus, Drosophila, Fugu, Anopheles,* C. elegans, Dog, Armadillo, Chimp*, Bushbaby, Cat etc.* - Ensembl pipeline used for primary annotation - Portable (if you have the compute power and a big enough problem to solve) - Stores data in MySQL database - Annotations accessible via the web - Multiple data mining interfaces - Now standardised repository for project annotations - Emphasis on inter-genome comparisons (compara)

Answer 30

1. Place known same organism (e.g. human) genes onto the genome 2. Place highly similar genes e.g. mus on genome (BLAST) 3. Predict novel genes from ab initio methods backed up with supporting evidence from sequence similarity – only use ones confirmed by similarity to protein, cDNA or ESTs (uses Genscan) First 2 stages are based around aligning PROTEINS to the genome DNA-DNA alignments don’t give translatable genes Essential to align at the protein level allowing for frameshifts and splice sites to get accurate gene model

Genome annotation Flashcards

(55 cards)