Sequence Statistics Flashcards
Explain the multinomial sequence model
The multinomial sequence model assumes that nucleotides appear independently from each other and with a fixed probability, according to a given distribution.
Probability of a sequence s is obtained by multiplying the observed nucleotide probabilities
Why is GC content GC(s) = (n(G,s) + n(C,s))/n enough to characterize all nucleotide frequencies?
- The content of G and C is often very similar (just like the content of A and T).
- The sum of all four frequencies has to be 1.
So GC content implicitly gives all nucleotide frequencies
List two potential uses for GC content GC(s) = (n(G,s) + n(C,s))/n
- Tell the difference between genomes of different organisms
- Tell the difference between coding and non-coding regions (human genes are comparably GC-rich)
Give the formula for GC skew
(#G - #C) / (#G + #C)
GC skew is calculated at successive positions in intervals (windows) of specific width
What happens to GC skew at origins and termini of replication?
GC skew often changes sign at origins and termini of replication
Give the formula of odds ratio for a dinucleotide AG
OR = fr(AG,s) / ((fr(A,s) * fr(G,s))
where fr(X,S) = n(X,s)/n is the (relative) frequency of X in s
How would you use odds ratio (OR) to find out if a k-mer is over/under-represented in a sequence?
Compare the frequency of the k-mer against the expected frequency if the k-mer is a random combination of l-mers, where l is smaller than k.
Any significant deviation of OR from 1 signals the fact that the k-mer is either over or under represented
Explain the idea behind a first order Markov chain
Every nucleotide in sequence X depends (only) on the previous nucleotide.
Probability of observing nucleotide b at position t given nucleotide a at position t-1 is given by a conditional probability p_{ab} = P(X_t = b | X_{t-1} = a)
Give the formula for estimating transition probabilities in a first order Markov chain.
Probability of transition from a to b equals dinucleotide frequency / base frequency of nucleotide a
p_{ab} = P(X_t = b | X_{t-1} = a) = P(X_t = b, X_{t-1} = a) / P(X{t-1} = a)
What is the starting codon?
ATG, codes Methionine (M)
Give a verbal explanation for the codon adaptation index (CAI)
CAI compares the distribution of codons actually used in a particular protein with the preferred codons for highly expressed genes.
Give the formula for the codon adaptation index (CAI)
(\prod_{k=1}^n p_k / q_k )^{1/n}
where p_k = the probability of codon k being used in highly expressed genes
and q_k = the highest probability that a codon coding for the same amino acid as codon k has