Annotation Flashcards
what does annotation mean in genomics?
to make sense of the assembly, characterizing functional elements
2 ideas about what a gene is
what is their main difference?
johannsen 1905 - the word gene is completely free from any hypothesis, many characteristics of the organism are conditioned by special, separable, and therefore self-existent fundamentals that occur in the gametes. (gene is defined by effect on characteristics of an org).
gene model - region of the genome to be transcribed into RNA and then protein. (no link to characteristics of an org)
what is an ORF
region of DNA which starts with ATG and ends in stop codon
difference btw PK and EK genes
PK - 1 gene is a contiguous region of DNA (no introns), intergene spaces are small. genome of smaller, fewer genes.
EK - exons separated by introns (removed from mRNA before translation).
what is key about alternative splicing?
it can increase genome complexity without increasing genome size.
in humans, 75% genes have an alternative isoform
2 approaches to identify gene models
- A prior/ab initio: look for sequence patterns. Protein-coding regions have distinctive patterns of codon statistics.
- Evidence based: Recognises regions corresponding to
previously identified gene models. uses similarity of translated AA seq to known proteins in other species.
Which is better approach to identify gene models?
A priori less biased but as more genomes are annotated, evidence based annotation becomes more reliable. however error propagation is a big issue.
what features of a gene does A priori look for?
ATG - start TATA box - promoter region (30bp upstream of ATG. binding site of RNA pol) Stop codon Splice sites on codons Poly adenylation sequence
these features are versatile, which presents a difficulty for A priori.
example of 2 protein domains
kinase domains - signal transduction
LRR leucine rich repeats - involved in protein protein interaction.
what is domain shuffling
gene segments coding for functional domains are shuffled between different genes in evolution.
what causes complexity when using evidence based method of gene model prediction?
Codon usage preference
Different orgs may prefer to have different codons for he same AA over other codons.
what is Pfam
a database which makes predictions about domains of proteins based an AA seq.
How much of human genome codes for proteins?
1.3%
23000
what is different about sub telomeric regions?
low in protein encoding genes.
what causes huge variation in gene length amongst protein encoding genes?
variation in intron size
exon size stays fairly constant, around 200bp.