Lecture 3 - Probability and Stats for Sequence Analysis Flashcards
Why would you want to identify the GC composition of a human genome?
it is the fraction of GC in genome and it is often hypermethylated and more GC due to more H binds increases the melting point of DNA
Given the GC composition of a list of possible source genomes and a sequence read, can you identify which is the most likely genome the read was sequenced from?
- need to turn genome into something for numerical encoding
- perform a statistical test
What are DNA binding proteins in DNA?
-they are proteins which form noncovalent bonds with short strecthed of DNA and there are multiple binding sites within a genime
-the specific loci where DBP bind are called binding sites and they are near other genes - the pattern bound is called a motif
-when a genome is intiallly sequnced the locations of binding sites are unknown
-forms ionic bonds
What does the LOGO binding motif show?
that there is a relative frequency of nucleotides ate DNA binding sites - every binding site gas an A at the fourth and fifth positions at the 12ths positions and G at the third position
-consider a model where the binding motif is a fixed number of bases k and the nucelotides of the motif have 100% frequency
What is the probability a certain chromosomal sequence is over represented in the genome?
need to consider a single chromosomal sequence in the forward strand from 5’ to 3’
What is the null model for DNA?
random nucleotide generator which does not care about the output before or output later meaning it is independent
When generating random nucleotides what is each base at position i considered to be?
the random variable Li
What is the domain of the random varaible Li or the values that it can take on?
Li E {A. C, G, T)
What are discrete random variables?
they are random variables which can take on a finite set of values
If a discrete random variable Li can take on J separate values with probabilites p1….pj then the sum of all the probabilites is?
1
What forms the probability distribution of a random variable?
p1,…,pj
For the random nucleotide generator the only constraint we place on nucleotide probabilties is what?
pA + pC + pG + pT = 1
it is not constrained to be equal unless specified - i.i.d.
What is i.i.d.?
pA=pC=pG=pT = 0.25
When are two random varaible independent?
if the probability of those two random variables happening is the product of them like if event X is getting a one one a 6 sided die and Y is getting a heads in coin toss the probability of them both occurring is 1/6*1/2 = 1/12
What is an example of a dependent variable?
the height and wieght of a person X and Y
How can you use the complement of A to find the P(A)?
How do you find the probability of A or B?
sum the two probabilites and subtract their intersection
What is the expectation of a random variable?
What is the expectation scale by a constant?
multiply the calculated expectation by the constant
Can you sum expectations?
yes
What is the variance if a random variable?
mean of squares - square of means
How do you scale a variance?
multiple by the square of constant
Can you sum variances?
yes
The expected number of a given nucleotide in a sequence needs to use an auxiliary variable Xi which is where?
If N is the random variable representing the number of As in a sequence then…
Consider the example below when calculating deviation from expectation…
A genome is thought to have a 50% GC composition and a sequence of 100 bases is observed with 65 nucleotides that are either G or C. What is the probability of such a deviation?
- Need to first fine number of expected bases in the genome. 50 GC nucleotides
- Use X^2 test
Consider the example below when comparing nucleotide composition…
Given a genome of length n with Sa, Sc, Sg, St counts of each nucleotide and a sample sequence of 100 nucleotides what is the proabability that the nucleotide counts deviate from what could be expected if the sample sequence was derived from the genome?
- Find the expected number of nucleotides in the read and calculate the probabilties and expected value
- use chi square for multiple df
What is the formula for df for multiple for chi sqaure?
(#rows -1)(#cols -1)