Lecture 3 - Probability and Statistics for Sequence Analysis Flashcards
What are some motivating examples for statistics for DNA?
-if you want to figure out the GC composition on a genome aka the fraction of a genome that is cytosine or guanine
-due to nonrandom variation some genomes have a low GC composition
Why is GC composition analysis important?
-it affects methylation since GCs tend to be methylated and the melting point of DNA because it increases it
Given the GC compositions of a list of possible source genomes and a sequence read can you identify which is the most likely genome the read was sequenced from?
- turn genome into something for computational numerical encoding
- perform a statistical test (need to create a null model)
What is a DNA binding protein?
-they are proteins that form noncovalent bonds and grab onto short 8-16bp stretches of DNA; cause repression or expression of a gene
-has multiple binding sites in a genome
What is the specific loci where DBP bind called and where are they often found?
-binding sites and they are often near genes and the pattern bound is called a motif
Are the locations of binding sites known when a genome is sequenced?
-no they are unknown
What are binding motifs and what is an example of one?
-the patterns that the DNA binding proteins typically bind to
-here nearly every binding site has an A at the fourth and fifth positions and a T at the 12th position and the third position is typically a G
-here we will consider a model where the binding motif is a fixed number of bases (k), and the nucleotides of the motif have 100% frequency
What is a random nucleotide generator?
-a null model for DNA and it does not care about the output before for an output later so they are independent events - the previous output does not dictate the next output
What random variable is used to denote each base at a position i?
Li
What is a random variable?
-a variable where the value is filled by some random generating function
What is the domain of a random variable?
Li E {A,C,G,T}
-the values a random variable can take on
What is a discrete random variable?
-a variable that takes on a finite set of values
If a discrete random variable like Li can take on J separate values with probabilities p1……pj then what is the value of the sum of its probabilities and what are the rough values of the individual probabilties?
each value has a nonzero probability and the sum of the probabilities is 1
What forms the probability distribution of a random variable?
-the values p1,…pj
-the probability that my random variable x takes on some random value j for all values of j
What does it mean for the random nucleotide generator if the probabilities are independent and identically distributed?
that pA=pT=pC=pG=0.25
-however it is not always constrained to have equal probabilties it must be stated in the problem that they are i.i.d. independent and identically distributed