Lecture 4 - k-mers and probability Flashcards
What is a k-mer?
a sequence of k bases; there are biologically relevant sequences are TF binding sites -they have a sequence of length k; k is a fixed parameter
-one of the most common units in computational sequence analysis
-when stored in binary a single 32 bit integers holds up to a 16-base kmer
-k-mers are used in many types of analysis including genome assembly, alignment free comparison, and genotyping
The probability of observing a specific k-mer generated by the random nucleotide generator is…
The product of observing two k-mers in a sequence is affected by whether or not the k-mers overlap, and is an instance of conditional probability?
-if you have a kmer at one pos and an adjacent kmer at the next pos they will overlap with k-1 bases
How is the conditional probability defined (i.e. like the probability of A given B)?
Given a sequence L1,…,Ln each a nucleotide with probabilites Pa,Pc,Pg,Pt and the first k-mer k=4 is GACT, what is the probability the k-mer starting in the third base is CTGG?
There are biological examples where the probability of observing (generating a particular nucleotide depends on what the nucleotide was before it
-the rows must sum to one but columns do not need to sum to one
What is a Markov chain?
-the next state is dependent on the previous state you are in
-the sum of every row is 1
-when generating a sequence the probabilities used for the next base correspond to the row indexed by the current base
-this type of sequence is called a Markov Chain
-each letter is a dependent random variable
Given a genome of size n how can we test if the bases are generated randomly (i.i.d.) or through a Markov process?
-calculate the expected number of times each dinucleotide is expected to appear in the genome
-calculate the frequency of each nucleotide and use this to estimate the probability of each base
-calculate the probability of each dinucleotide
-calculate the expected number of times each dinucleotide should appear in the genome
-count the number of times each dinucleotide appears in the genome
-compare the number of observed counts of each dinucleotide to the expected using the chi-squared test (how many df)?
How do you find the distribution of counts of an outcome?
-want to know how many probability events are 0 and how many probability events are 1
What is the binomial distribution?
-the distribution of the total number of 1s in n Bernoulli trials each with the probability p defined by
-the expected value of P(x) with parameters n,p is np
-the variance of P(x) is np(1-p)
What two parameters is the binomial distribution defined by?
the number of trials N and the probability of success of each trial p
A DNA sequence R of 100 nucleotides is generated with a pA =0.3 and p-A=0.7. What is the probailbity of v As in the sequence?
How to test if a sequence has a different composition than what was expected?
Probability distribution for counts of low probability, high number of trials: Given a genome of length n, and a k-mer of probability p we can estimate the number of times the k-mer appears as np and the distribution as … (look at image)… how can we calculate this low probability and high number of trials?
The Poisson distribution
What is the poisson distribution defined by?
A parameter lambda h
where lambda = np
-this is the mean or expected value
A genome G is composed of N = 3,000,000 bases that may be modeled as independent random variables(e.g. Li has values ACGT with pA,pC,pG,pT = 0.1,0.3,0.4,0,2). If a k-mer ACCGACGGT is observed 30 times what is the probability this k-mer would appear at least this many times in the genome?
- Need to calculate the probability that the k-mer occurs by multiplying 0.10.30.30.40.10.30.40.40.2 = 3.456X10^-6
2.Calculate the lambda or np or the expected number of times the k-mer appears in the genome G = np = 30*3.456X10^-6 = 31.104
- Now count the probability there are 31 or more occurenace of the k-mer in G
What are the binoimial and Poisson examples of distributions or what types of variables?
-discrete random variables
What is a continuous random variable?
variables which represent infinitesimal densities
What describes the probability of a continuous random variable and what is it denoted by and what does it mean?
-probability density function or pdf
-denoted by f(x)
-the probability that random variable X lies in the interval [a,b] is given by the integral of the pdf over that interval
For any continuous random variable X with a pdf f(x), what should the total are under the curve be equal to?
One
What is the mean or expected value of a continuous random variable X with a pdf f(x)?
What is the variance of a continuous random variable X defined as?
the expected value of the squared deviation from the mean
What is the probability of a single point in a uniform distribution?
dx
What is the probability density function or pdf of the standard normal distribution?
Central Limit Theorem