Lecture 3 - Probability and Stats for Sequence Analysis Flashcards
Why would you want to identify the GC composition of a human genome?
it is the fraction of GC in genome and it is often hypermethylated and more GC due to more H binds increases the melting point of DNA
Given the GC composition of a list of possible source genomes and a sequence read, can you identify which is the most likely genome the read was sequenced from?
- need to turn genome into something for numerical encoding
- perform a statistical test
What are DNA binding proteins in DNA?
-they are proteins which form noncovalent bonds with short strecthed of DNA and there are multiple binding sites within a genime
-the specific loci where DBP bind are called binding sites and they are near other genes - the pattern bound is called a motif
-when a genome is intiallly sequnced the locations of binding sites are unknown
-forms ionic bonds
What does the LOGO binding motif show?
that there is a relative frequency of nucleotides ate DNA binding sites - every binding site gas an A at the fourth and fifth positions at the 12ths positions and G at the third position
-consider a model where the binding motif is a fixed number of bases k and the nucelotides of the motif have 100% frequency
What is the probability a certain chromosomal sequence is over represented in the genome?
need to consider a single chromosomal sequence in the forward strand from 5’ to 3’
What is the null model for DNA?
random nucleotide generator which does not care about the output before or output later meaning it is independent
When generating random nucleotides what is each base at position i considered to be?
the random variable Li
What is the domain of the random varaible Li or the values that it can take on?
Li E {A. C, G, T)
What are discrete random variables?
they are random variables which can take on a finite set of values
If a discrete random variable Li can take on J separate values with probabilites p1….pj then the sum of all the probabilites is?
1
What forms the probability distribution of a random variable?
p1,…,pj
For the random nucleotide generator the only constraint we place on nucleotide probabilties is what?
pA + pC + pG + pT = 1
it is not constrained to be equal unless specified - i.i.d.
What is i.i.d.?
pA=pC=pG=pT = 0.25
When are two random varaible independent?
if the probability of those two random variables happening is the product of them like if event X is getting a one one a 6 sided die and Y is getting a heads in coin toss the probability of them both occurring is 1/6*1/2 = 1/12
What is an example of a dependent variable?
the height and wieght of a person X and Y