Sequence Statistics Flashcards

1
Q

Explain the multinomial sequence model

A

The multinomial sequence model assumes that nucleotides appear independently from each other and with a fixed probability, according to a given distribution.

Probability of a sequence s is obtained by multiplying the observed nucleotide probabilities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is GC content GC(s) = (n(G,s) + n(C,s))/n enough to characterize all nucleotide frequencies?

A
  • The content of G and C is often very similar (just like the content of A and T).
  • The sum of all four frequencies has to be 1.

So GC content implicitly gives all nucleotide frequencies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

List two potential uses for GC content GC(s) = (n(G,s) + n(C,s))/n

A
  1. Tell the difference between genomes of different organisms
  2. Tell the difference between coding and non-coding regions (human genes are comparably GC-rich)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Give the formula for GC skew

A

(#G - #C) / (#G + #C)

GC skew is calculated at successive positions in intervals (windows) of specific width

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What happens to GC skew at origins and termini of replication?

A

GC skew often changes sign at origins and termini of replication

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Give the formula of odds ratio for a dinucleotide AG

A

OR = fr(AG,s) / ((fr(A,s) * fr(G,s))

where fr(X,S) = n(X,s)/n is the (relative) frequency of X in s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How would you use odds ratio (OR) to find out if a k-mer is over/under-represented in a sequence?

A

Compare the frequency of the k-mer against the expected frequency if the k-mer is a random combination of l-mers, where l is smaller than k.

Any significant deviation of OR from 1 signals the fact that the k-mer is either over or under represented

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explain the idea behind a first order Markov chain

A

Every nucleotide in sequence X depends (only) on the previous nucleotide.
Probability of observing nucleotide b at position t given nucleotide a at position t-1 is given by a conditional probability p_{ab} = P(X_t = b | X_{t-1} = a)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Give the formula for estimating transition probabilities in a first order Markov chain.

A

Probability of transition from a to b equals dinucleotide frequency / base frequency of nucleotide a

p_{ab} = P(X_t = b | X_{t-1} = a) = P(X_t = b, X_{t-1} = a) / P(X{t-1} = a)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the starting codon?

A

ATG, codes Methionine (M)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Give a verbal explanation for the codon adaptation index (CAI)

A

CAI compares the distribution of codons actually used in a particular protein with the preferred codons for highly expressed genes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Give the formula for the codon adaptation index (CAI)

A

(\prod_{k=1}^n p_k / q_k )^{1/n}

where p_k = the probability of codon k being used in highly expressed genes
and q_k = the highest probability that a codon coding for the same amino acid as codon k has

How well did you know this?
1
Not at all
2
3
4
5
Perfectly