Gene annotation Flashcards
What is structural vs functional gene annotation?
In structural annotation we want to find the regions of interest in the genome. This is called gene prediction - meaning that we try to find where in the genome the genes are and their structure.
In functional annotation we assign biological meanings to those genes. What genes gives which biological function.
What are the main approaches to gene prediction?
Ab Initio methods that find genes based on statistical methods and machine learning without using expermental data.
Similarity based methods predict genes based on comparing the unknown genome to a known genome.
Hybrid methods are when Ab initial methods integrate external evidence.
How does the Ab Initio method predict genes?
The algorithm takes an observed sequence and predict genes based on ORFs, known patterns of nucleotides and patterns of codon usage ect.
It uses a HMM to find the highest probable structure of the gene give the observed sequence instead of external evidence.
What are the advantages and disadvantages of Ab Initio methods?
Advatanges: They can be used to annotate genes that are lowly expressed. They are fast and easy.
Disadvantages: Needs to be trained - meaning that they will not be good alternatives for annotating genomes that we know very little about. They can give false positives.
What are the advantages and disadvantages of the similarity based methods?
Advantages: Makes biologically relevant predictions and it produces evidence that can be used in the Ab Initio methods.
Disadvantages: Lowly expressed genes will most likely not be catched.
How can we do functional annotation
Blast the gene and assign the function from the resulting alignment.
What does a HMM do in an Ab Initio algorithm?
It takes an observed sequence and gives you the highest probable state path (pattern of exons and introns).
It predicts the structure of the gene.
In a HMM in an Ab Initio model, what is:
- A state
- Transitions
- Observations
States = If you have a dna sequence and you want to find the genes, you divide the sequence into smaller parts or “states”. Each state represents a different part of the sequences like exons and introns.
Transitions = The HMM help you understand the probability of moving from one state to another in the dna sequence. How likely is it to go from an exon to an intron?
Observations = As you walk through the sequence you cannot see the genes since they are “hidden” but you can see the nucleotides. The HMM gives you the probability of observing specific nucleotides in each state. The nucleotides are the observations.
In a HMM in an Ab Initio model, what is:
- transition probability matrix
- Emission probability matrix
What are the uses of these?
Transition probability matrix = The probability of moving from one state (exon or intron) to another state (exon or intron) given what state you are currently in.
Emission probability matrix = The probability of observing particular nucleotides given that you are in an exon or intron. The emission probabilities are used to calculate the likelihood of observing a given sequence when the HMM is in a specific state. In gene predicition this is used for the HMM to determine which regions of a dna sequence are likely to be exons and introns based on the observed nucleotides.
How do you calculate the probability of a given state path given an observed sequence?
The probability to generate a state path q1..qt given an observed sequence is the product of all the emission probabilities and transition probabilities
The product of the starting probability of one state, the transition probability from the previous state to the current and the emission probability of the current nucleotide given the state that you’re in.
What is the Viterbi algorithm useful for?
The Viterbi algorithm finds the most probable state path given a sequence and a hidden markov model.
You do however need to train it on known gene structures for it to correctly assess transition probabilities and emission probabilities for each state.
How do you asses the quality of an annotation?
You compare the prediction to a reference gene model.
Sensitivity = True positives/true positive + false negative
Specificity = True positive/True positive + false positive
Accuracy = sens +spec /2
What is the strongest evidence for gene annotation?
RNA sequence data since it directly implies transcription. RNAseq is empirical data.
How does repeat-rich sections affect gene assembly and gene annotation?
It affects both of the badly.
It is hard to assemble a genome that is repeat rich since the sequencing will have a hard time figuring out the actual sequence if repeats are too many.
If you leave the repeats unmasked they will also affect the annotation:
- Generate millions of spurious BLAST hits, hard to do the functional annotation with too many hits.
-ORFs in transposons can be wrongly annotated as additional exons.
What types of repeat masking methods are there?
de novo identification
Similarity search vs repeat libraries (eg RepeatMasker)