Genome analysis - annotation Flashcards

1
Q

What is annotation?

A

Finding which parts of the genome are functioning and what they are doing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the difference between structural and functional annotation?

A

Structural annotation is gene predictions and finding out which parts of the genome is functioning. We do not know the function of the genes.

Functional annotation is finding out what the predicted genes are doing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do we find sequences with function?

A

If something has a function it is normally under selection so the null is that the evolution is due to drift (not functioning) and then we try to disprove it which would mean functional genes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How many times would you expect to find GACGGC in 1000 bases with a base composition of 70% G+C?

How does this relate to structural annotation?

A

P(GACGGC) = 0.35^5 x 0.15 = 0.000788
P(G) & P(C) = 35% (= 70% together)
P(A) & P(T) = 15% (half of the last 30%)

Determine how many 6-mers we have in a 1000, 1000-6 +1 = 995x2 = 1990.
probability x number of 6-mers = how many times we expect to see the sequence:
1990 x 0.000788 = 1.57 just by random chance. If we see the sequence more than this then it is occurring more often than what is probable by just randomness and maybe it then has a function and is important.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

When we do structural annotation, what are we trying to predict?

What do we start with and how does it work?

A

Protein coding genes
stable RNA(tRNA, rRNA)
small RNA
regulatory elements

We usually start with the stable RNA because they are easier to predict (non-coding sequences tRNA, rRNA). tRNA is easy to predict because of their specific structure. The algorithm is like a decision tree where you move through it with smaller windows of the sequence and accumulate scores based on how often your sequence fulfills certain characteristics for being a tRNA. You start with your most conserved region and if that does not match a tRNA then you immediately move on to the next window.

Once you have predicted the stable RNA you know that you do not have to try to predict anything else from those sequences.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

When we are trying to predict protein coding genes in structural annotation, what are we looking for? What is the challenge?

A
  • ORFs
  • exons and introns

The simplest demand for being a protein coding gene is to have ORFs and the challenge is to determine if the ORF is an actual gene or just random. Short ORFs are hard to distinguish from random noise. Long ORFs do not normally happen just by chance so if we find a long ORF it is likely that it is a gene.

This is especially difficult for eukaryotes since their exons are short (~150bp).

Another challenge is that the start codons are difficult to predict because the statistics do not change much with a change of start codon. So if there are two start codons close to each other it is very hard to know which one is used and where the gene actually starts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How can we determine if an ORF is indicating a gene or if it is random?

A

We can look at patterns of codon usage within the ORF to see if the patterns match what we know real genes look like.

Real genes should show consistent patterns of codon usage because non-synonymous changes should be purged as to not alter the amino acid sequence and the function of the gene. The changes we see should mainly be in the third position of the codon because those changes are most often synonymous.

We can also calculate how many ORFs of the length we have we expect to fins just by random chance in the sequence. If that number is very low then the likelihood of that ORF being a real gene is higher.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the expected number of ORFs longer than 450 bp in a circular DNA molecule with uniform base composition and of 6000 bp length?

A

Calculate the probability of the start codons and how many we expect to find in the 6000 bp sequence. There are 64 possible start codons.
probability = 1/64

6000 bases x 2 x (1/64) = 187,5 start codons in our sequence.

We have to find 150 codons that are not stop codons after the start (61/64, because we have 3 stop codons).

187.5 x 0.000746 = 0.14 ORFs in this sequence by chance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why is it easier to predict functioning genes in prokaryotes than eukaryotes?

A

Prokaryotes do not have introns and exons so the gene prediction is more straight forward because one ORF will represent one functionings gene.

In eukaryotes the ORFs are interrupted by introns and exons and the exons that we want to predict are usually short (~150bp) and will therefor be hard to separate from random ORFs. Even if we can predict all the exons we still will not be able to predict all proteins because alternative splicing, we will not know which exons are kept and which are cut out in the splicing. This make sit very difficult to use ab initio methods for eukaryotic gene prediction.

We can look at the transcriptome of eukaryotes to see the same thing as for prokaryotes but the alternative splicing is still a problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are ab into methods? Can we use them alone for eukaryotic gene prediction?

A

Gene prediction methods that predict genes solely based on looking at the sequence.

Normally they are enough for gene prediction in prokaryotes but they are not used alone for eukaryotes due to their complex genomes. For eukaryotes we must combine these methods with actual evidence to get the structural annotation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What can we do to get a higher level structural annotations for eukaryotes? What evidence can we use to make us more sure of the annotations?

A

The first level is just running an ab into once but we can make the annotation a bit better if we run the ab initio many times.

The highest level of annotation is if we manually work on the genome.

Evidence may be that we run predicted genes against protein databases like SwissProt since these sequences are curated and manually checked.

By aligning RNA-seq reads to the genome assembly, we can identify regions of transcription and infer the locations of gene exons and introns.

Curated nucleotide and protein sequences serve as references for known genes and proteins.
Alignment of these sequences to the genome assembly allows for the identification of homologous regions and the validation of predicted gene models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Give examples of ways to get functional annotations of your predicted genes.

A
  • Similarity searches - BLAST
  • Domain and motif searches - say something general about the function.
  • Protein fold predictions - can tell us where in the cell the protein is functioning. However very costly.
  • Clustering of sequences into families.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the main sources of error in functional annotation?

A
  • experimental errors
  • divergent functions - sequence similarity does not guarantee functional conservatism.
  • the protein may have more than one function
    similarity errors - previous annotation that you based yours on is wrong, the source of information is important, use curated databases.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How does ab initio methods predict genes?

A

Based on only intrinsic sequence information.

Sequence information that help these methods in predicting where the genes are located are:

  • Sequence motifs such as splice sites, start codons and stop codons which indicate that there is an ORF there. The ab initio methods may for example look for ORFs of a length threshold as a basis for gene prediction.
  • To find out if the ORFs are indeed real genes and not just random, these methods also look at the sequence for specific nucleotide patterns that follow what we know that genes look like from training data, GeneMark looks at dicodons for example, patterns of codon usage that signifies that a region is more conserved (ex. only changes in the second position because those changes are usually synonymous).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How can we use comparative genetics to get functional annotations of a gene in a new species?

A

We can get functional annotations from other orthologues.

orthologues are genes in other species with the same function that come from the same ancestor as your focal and comparisons can help infer functional annotations if time since divergence is not too far.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly