Lecture 4 - k-mers and probability Flashcards
What is a k-mer?
a sequence of k bases; there are biologically relevant sequences are TF binding sites -they have a sequence of length k; k is a fixed parameter
-one of the most common units in computational sequence analysis
-when stored in binary a single 32 bit integers holds up to a 16-base kmer
-k-mers are used in many types of analysis including genome assembly, alignment free comparison, and genotyping
The probability of observing a specific k-mer generated by the random nucleotide generator is…
The product of observing two k-mers in a sequence is affected by whether or not the k-mers overlap, and is an instance of conditional probability?
-if you have a kmer at one pos and an adjacent kmer at the next pos they will overlap with k-1 bases
How is the conditional probability defined (i.e. like the probability of A given B)?
Given a sequence L1,…,Ln each a nucleotide with probabilites Pa,Pc,Pg,Pt and the first k-mer k=4 is GACT, what is the probability the k-mer starting in the third base is CTGG?
There are biological examples where the probability of observing (generating a particular nucleotide depends on what the nucleotide was before it
-the rows must sum to one but columns do not need to sum to one
What is a Markov chain?
-the next state is dependent on the previous state you are in
-the sum of every row is 1
-when generating a sequence the probabilities used for the next base correspond to the row indexed by the current base
-this type of sequence is called a Markov Chain
-each letter is a dependent random variable
Given a genome of size n how can we test if the bases are generated randomly (i.i.d.) or through a Markov process?
-calculate the expected number of times each dinucleotide is expected to appear in the genome
-calculate the frequency of each nucleotide and use this to estimate the probability of each base
-calculate the probability of each dinucleotide
-calculate the expected number of times each dinucleotide should appear in the genome
-count the number of times each dinucleotide appears in the genome
-compare the number of observed counts of each dinucleotide to the expected using the chi-squared test (how many df)?
How do you find the distribution of counts of an outcome?
-want to know how many probability events are 0 and how many probability events are 1
What is the binomial distribution?
-the distribution of the total number of 1s in n Bernoulli trials each with the probability p defined by
-the expected value of P(x) with parameters n,p is np
-the variance of P(x) is np(1-p)
What two parameters is the binomial distribution defined by?
the number of trials N and the probability of success of each trial p
A DNA sequence R of 100 nucleotides is generated with a pA =0.3 and p-A=0.7. What is the probailbity of v As in the sequence?
How to test if a sequence has a different composition than what was expected?