Algos Problem 2: Finding Regulatory Motifs in DNA Flashcards

1
Q

How many k-mers exist for DNA?

A

A, T, C, G ….. 4 Nukleotiden

–> 4^k k-mers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the consensus of the following motifs?

T C G G G G g T T T t t
c C G G t G A c T T a C
a C G G G G A T T T t C
T t G G G G A c T T t t
a a G G G G A c T T C C
T t G G G G A c T T C C
T C G G G G A T T c a t
T C G G G G A T T c C t
T a G G G G A a c T a C
T C G G G t A T a a C C

A

Consensus(Motifs) T C G G G G A T T T C C

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the score of the following motifs?

T C G G G G g T T T t t
c C G G t G A c T T a C
a C G G G G A T T T t C
T t G G G G A c T T t t
a a G G G G A c T T C C
T t G G G G A c T T C C
T C G G G G A T T c a t
T C G G G G A T T c C t
T a G G G G A a c T a C
T C G G G t A T a a C C

A

Score(Motifs) 3 + 4 + 0 + 0 + 1 + 1 + 1+ 5 + 2 + 3 + 6 + 4 = 30

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Define the score of a set of motifs.

A

Score(Motifs): # of lowercase letters in motif matrix

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Brute Force Motif Search: Compute score of every possible choice of k-mers in DNA.

STOP and Think: What is this approach’s runtime?

A

Say that DNA consists of t strings of length n:

  • There are (n – k + 1)^t different choices of Motifs
  • Scoring the matrix requires k*t steps.

Overall runtime ~ O(nt * k * t) – too slow!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Hamming distance between a sequence and Consensus(Motifs) ?

A

Number of lowercase symbols in a row

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Motif Finding Problem Runtime: n^t ∙k∙t

versus

Median String Problem Runtime: 4^k∙n∙t∙k

which is faster?

A

4^k∙n∙t∙k is faster

Although MedianString is much faster than BruteForceMotifSearch, it is
still slow for large k.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

From Motifs to Profile?

eg

T C G G G G g T T T t t
c C G G t G A c T T a C
a C G G G G A T T T t C
T t G G G G A c T T t t
a a G G G G A c T T C C
T t G G G G A c T T C C
T C G G G G A T T c a t
T C G G G G A T T c C t
T a G G G G A a c T a C
T C G G G T A T a a C C

A

frequency of nucleotidei in column j

eg

A: .2 .2 0 0 0 0 .9 .1 .1 .1 .3 0
C: .1 .6 0 0 0 0 0 .4 .1 .2 .4 .6
G: 0 0 1 1 .9 .9 .1 0 0 0 0 0
T: .7 .2 0 0 .1 .1 0 .5 .8 .7 .3 .4

Each column of a profile represents a biased 4-sided die with A, C,
G, and T on its sides. Thus, a profile corresponds to k dice.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Given the following Profile:

A 1/2 7/8 3/8 0 1/8 0
C 1/8 0 1/2 5/8 3/8 0
T 1/8 1/8 0 0 1/4 7/8
G 1/4 0 1/8 3/8 1/4 1/8

The probability of the consensus string:
Pr(AAACCT|Profile) = ???

A

The probability of the consensus string:

Pr(AAACCT|Profile) = 1/2 x 7/8 x 3/8 x 5/8 x 3/8 x 7/8 = 0.033646

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

The closer k-mer is to the consensus string, the _____ the Pr(k-mer| Profile) is.

A

larger

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Profile-most probable k-mer in a
sequence

A

the k-mer with the highest Pr(k-mer | Profile) among all k-mers in this sequence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

shorthand for probability of a sequence according to a matrix

A

P(sequence|matrix) =

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

bei Motifsuche:
Welche Position hat die höhere (Shannon) Entropie? (Positionen fangen mit der Zahl 1 an.) Wann ist die Entropie am höchsten?

A

Position ____ hat die hoehere Entropie, da die Eintraege naeher an einer Gleichverteilung sind (alternativ: mehrere Eintraege fast gleich wahrscheinlich sind).

Die Entropie ist für eine Gleichverteilung am hoechsten.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

bei Motifsuche:
von ein Profilmatrix bekommt man fuer ein Sequenz die Wahrscheinlichkeit 0, weil an einer Stelle 0 ist, obwohl die andere Stellen hoeher Wahrscheinlichkeiten haben. Wie laesst sich diese Problem umgehen?

A

Obwohl die Wahrscheinlichkeit an den mesiten Stellen hoch ist, wird die gesamt wahrscheinlichkeit 0, weil an eine Stelle Null ist.
Pseudocounts werden hinzugefuegt nach Laplace, um eine Wahrscheinlichkeit von Null zu vermeiden

How well did you know this?
1
Not at all
2
3
4
5
Perfectly