Lecture 3 - Probability and Statistics for Sequence Analysis Flashcards

1
Q

What are some motivating examples for statistics for DNA?

A

-if you want to figure out the GC composition on a genome aka the fraction of a genome that is cytosine or guanine
-due to nonrandom variation some genomes have a low GC composition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is GC composition analysis important?

A

-it affects methylation since GCs tend to be methylated and the melting point of DNA because it increases it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Given the GC compositions of a list of possible source genomes and a sequence read can you identify which is the most likely genome the read was sequenced from?

A
  1. turn genome into something for computational numerical encoding
  2. perform a statistical test (need to create a null model)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a DNA binding protein?

A

-they are proteins that form noncovalent bonds and grab onto short 8-16bp stretches of DNA; cause repression or expression of a gene
-has multiple binding sites in a genome

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the specific loci where DBP bind called and where are they often found?

A

-binding sites and they are often near genes and the pattern bound is called a motif

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Are the locations of binding sites known when a genome is sequenced?

A

-no they are unknown

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are binding motifs and what is an example of one?

A

-the patterns that the DNA binding proteins typically bind to
-here nearly every binding site has an A at the fourth and fifth positions and a T at the 12th position and the third position is typically a G
-here we will consider a model where the binding motif is a fixed number of bases (k), and the nucleotides of the motif have 100% frequency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a random nucleotide generator?

A

-a null model for DNA and it does not care about the output before for an output later so they are independent events - the previous output does not dictate the next output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What random variable is used to denote each base at a position i?

A

Li

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a random variable?

A

-a variable where the value is filled by some random generating function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the domain of a random variable?

A

Li E {A,C,G,T}
-the values a random variable can take on

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a discrete random variable?

A

-a variable that takes on a finite set of values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

If a discrete random variable like Li can take on J separate values with probabilities p1……pj then what is the value of the sum of its probabilities and what are the rough values of the individual probabilties?

A

each value has a nonzero probability and the sum of the probabilities is 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What forms the probability distribution of a random variable?

A

-the values p1,…pj
-the probability that my random variable x takes on some random value j for all values of j

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does it mean for the random nucleotide generator if the probabilities are independent and identically distributed?

A

that pA=pT=pC=pG=0.25
-however it is not always constrained to have equal probabilties it must be stated in the problem that they are i.i.d. independent and identically distributed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What determines if two variables are independent?

A

if they are the same as the product of their two probabilties

17
Q

What is an example of an independent and dependent variable?

A

-Independent - X is the outcome of a 6-sided die and Y is a coin toss
-Dependent - X is the height of a person and Y is the person’s weight

18
Q

Given the random variable A what is the probability of A given the complement of the random variable A?

A
19
Q

Given two random variable A and B what is the probability of A or B happening?

A
20
Q

On a 3-way Venn diagram show the area corresponding to p (A n - (B U C)

A
21
Q

What is the expectation of a random variable?

A

-the average - aka the sum of the probability of a random variable or the mean
-the probability of each value multiplied by that value and summed

22
Q

How do you scale the expectation of a random variable by c?

A

just multiply the calculated expectation of a random variable by c

23
Q

Can you sum expectations? How do you sum them if it is identical and independent distribution?

A

Yes you can sum them; if they are independent and identically distributed then you can just multiple by n or the number of expectations

24
Q

What is the variance of a random variable?

A

mean of the squares minus the square of the means

25
Q

What do you do if you want to scale a variance of a random variable?

A

multiple it by the square of the scale

26
Q

When computing the expected number of a particular base you need to use an auxiliary variable which is what?

A

-Xi; then you let N be the random variable representing the number of As in a sequence

27
Q

Calculating deviation from expectation - Use case: a genome is thought to have a 50% GC composition. A sequence of 100 bases is observed with 65 nucleotides that are either G or C. What is the probability of such a deviation?

A

-this is a case with a random variable with one degree of freedom since df = n-1 = 2-1 = 1
-use X^2 statistic for one degree of freedom

28
Q

What is the expected % of bases in a 100 base nucleotide read that are either G or C is the read came from a genome that was 50% G/C?

A
29
Q

What if all four nucleotides were counted instead of GC composition? Given a genome of length n with Sa,Sc,Sg,St counts of each nucleotide and a sample sequence of 100 nucleotides what is the probability that the nucleotide counts deviate from what would be expected if the sample sequence was derived from the genome?

A

pA = Sa/n
pC = Sc/n
pG = Sg/n
pT = St/n

E(#As) = pA100
E(#Cs) = pC
100
E(#Gs) = pG100
E(#Ts) = pT
100

the X^2 distribution for a variable that J possible outcomes is calculated as:
-you would do this for only A, G, C but not T cause J-1 is the degrees of freedom (cause the number of Ts is fixed once we calculate the other three)

30
Q
A
31
Q
A