genome annotation Flashcards
1
Q
annotation
A
- describing features of raw sequence:
- genes
- regulatory features
- repeats
- areas of conservation
- syntenic regions
- evolutionary relationships between genomes
2
Q
process of annotation
A
- assign gene functions
- includes variable splice site identification
- identify regulatory features and functions
- can be subtle
- identify repeats
- comparative studies with other genomes
- mainly automated but also VEGA manual intervention in places
- context is important
3
Q
simple metrics
A
- first step
- counting and analysing structure
- includes base confidence scores
- rolling GC content
- di/trinucleotide bias
- patterns present more often than expected by chance → role
- codon use
- increased use → common tRNAs → gene more easily expressed
- gene with rare codons expressed less
- housekeping genes → more common codons
4
Q
GC content
A
- important for gene prediction
- influences survival in the environment
- varies widely across genome and within genomes
- correlation between exon length and gene content
- much higher GC with shorter introns
- 65% (300 base) vs 30% (2300 base)
5
Q
human GC content
A
- ~38%, varies widely
- between 35% and 60% in a 100kB fragment
- in genes:
- more uniform
- 45-50%
- regions of high GC generally higher gene density
- software based prediction
6
Q
bacterial replication
A
- oriC on both strands → replication in both directions
- reverse strand delayed due to opening and okazaki fragments
- ssDNA exposed → deamination of C to T 100x more frequently
- TG mismatch → mutate to TA in next round of replication
- loss of C on reverse strand
- relative decrease in G on reverse strand
- decrease in C on forward strand
- plot skew
- minimum - origin
- maximum - termination
7
Q
GC skew in higher organisms
A
- multiple origins of replication
- 3 minima indicates 3 oriC e.g. archaea
- more difficult in eukaryotes
- yeast - 400 ori
- need to look for other patterns (ARS)
8
Q
eukaryotic GC skew
A
- ARS consensus sequence in yeast
9
Q
other measures
A
- dicodon counts
- frequency of occurrence of successive codon pairs
- 3rd base periodicity
- e.g. same nucleotide at n, n+2 etc
- length and occurrence of ORFs
- stretches between stop codons
- promoters and TF binding sites (subtle)
10
Q
subtlety in human genome
A
- G and C more prevalent in first 50 nt of intron
- GGG 4x more frequent
- VWG consensus in exons
- not T, A/T, G
- minimal periodicity of 10 nt
- weaker in introns
- phase bending potential towards major groove
- increased accessibility