ch2 Flashcards
state space
the set of values which a process can take
probability of a union pf two events
p(A) + p(B) - p(A and B)
product rule
probsbility of the joint event A and B is
p(A, B) = p(A and B) = p(A|B)p(B)
sum rule, law of total probability
p(A) = sumOverB( p(A,B) ) = sumOverB( p(A|B = b)p(B=b)
where we are summing over all possible stqtes of B
marginal distribution
gives the probabilities of various values of the variables in the subset without reference to the values of the other variables.
of a subset of a collection of random variables is the probability distribution of the variables contained in the subset
chain rule of probability
permits the calculation of any member of the joint distribution of a set of random variables using only conditional probabilities with successive applications of the law of total probability and product rule
with four variables, chain rule produces this:
P(a, b, c, d) = P(a | b, c, d) * p(b | c, d) * p(c | d) * p(d)
conditional probability
p(A|B) = p(a,b)/p(b)
Bayes rule
P(X = x | Y = y) = p(X = x, Y = y)/p(Y=y) = [ p(X=x)p(Y = y | X = x) ] / sumOverX[ p(X = x)p(Y = y | X = x)
Sensitivity
aka the true positive way, the recall, the probability of detection
Measure sthe proportion of positives that are correctly identified as such
The probability that a test will be positive when it is supposed to be positive
Base rate fallacy
If presented with related base rate information (i.e. generic, general information) and specific information (information pertaining only to a certain case), the mind tends to ignore the former and focus on the latter.
Generative classifier
Classifier that specifies how to generate the data using the class-conditional density p(x | y = c) and the class prior p(y = c)
Discriminative classifier
Classifer that directly fits the class posterior p(y = c | x). In contrast to generative models, which are models for generating all values of a phenomenon, both those that can be observed in the world and target variables that can only be computed from those observed, discriminative classifiers provide a model ONLY for the target variables.
In simple terms, discriminative models infer outputs based on inputs, while discrminative models generate both inputs and outputs.
Unconditional or marginal independence
Two events X and Y if p(X, Y) = p(x)p(Y)
Conditional independence (CI)
X and Y are conditionally independent given Z iff the conditional joint can be written as a product of conditional marginals
p(X,Y | Z) = p(X|Z)p(Y|Z)
Cumulative distribution function (cdf)
the probability that X will take a value less than or equal to x
Probability density function (pdf)
a function, whose value at any given sample (or point) in the sample space (the set of possible values taken by the random variable) can be interpreted as providing a relative likelihood that the value of the random variable would equal that sample
Variance
Variance, measure of the spread of a distribution
Standard deviation
Square root of the variance, useful since it has the same units as X itself
Binomial distribution
with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes–no question, and each with its own boolean-valued outcome: a random variable containing single bit of information: success/yes/true/one (with probability p) or failure/no/false/zero (with probability q = 1 − p)
Tail area probabilities
The probability that a random variable deviates by a given amount from its expectation
Variance
Measure of the spread of a distribution, denoted by sigma^2. Defined as
Var[X] = E[(X-population mean)^2]
Standard deviation
Derivation from variance: std[X] = sqrt(var[X]) that’s useful because it has the same units as X itself
Binomial distribution
The discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes-no question and each with its own boolean-valued outcome
Pmf: p^k(1-p)^(n-k)
Bernoulli distribution
Probability distribution of a random experiment w/ exactly two possible outcomes iin which the probability of success is the same every time an experiment is conducted
Multinomial distribution
Variation of binomial distribution involving more than two outcomes.
PMF is [(n!)/(x1! X2! … xk!)]p1^(x1)…pk^(xk) for k possible outcomes, n events, x is the number of times outcome k occurs
Poisson distribution
Poi(x | k) = e ^(-k) [(k^x)/(x!)]
First term is normalization constant ensuring distribution sums to 1.
Expresses the probability of a given number of events occurring in a fixed interval of time/space if 1) these events occur with a known constant rate and 2) independently of the time since the last event.
Empirical distribution
Fn(t) = number of elements in sample <= t / n
Dirac measure
Assigning a size to a set based solely on whether it contains a fixed element x or not.
Gaussian distribution
Use because of the central limit theorem (stats that the averages of samples of observations of random variables independently drawn from independent distributions converge in distribution to the normal. Physical quantities that are expected to be the sum of many independent processes (eg measurement errors) thus often have distributions that are nearly normal.
Probability density is [1/sqrt(2pivariance)]e^[(x-populationmean)^2/(2variance)]
Precision of a gaussian
The inverse variance of a guassian, 1/variance. A high precision means a narrow distribution centered on the population mean.
Error function
Special function of sigmoid shape that describes diffusion. erf(x) = 1/sqrt(pi) * integral from x to 0 of e^(-t^2)
For nonnegative values of x, the error function has the following interpretation: for a random variable Y that is normally distributed with mean 0 and variance 1/2, erf(x) describes the probability of Y falling in the range [-x, x]].
dirac delta function
formed in the limit that variance -> 0 where the gaussian becomes an infinitely tall and infinitely then “spike” centered at the mean
has the sifting property which selects out a single term from a sum or integraton since the integrand is only non-zero if x - mean =0
student’s t distribution
used since gaussians are more sensitive to outliers as their log probability only decays quadratically w distance from the center
[1 + (1/v)((x-u)/(o))^2]^-(v+1/2)
u is mean, o^2 is scale parameter, v is degrees of freedom. variance is actually (vo^2)/(v-2)
cauchy/lorentz distribution
t-distribution with degree of freedom 1. has such a heavy tail that the integral defining the mean doesnt converge
laplace distribution
another distribution with a heavy tail (low sensitivity to outliers), aka the double-sided exponential distribution
(1/2b)*exp(-|x-mean|/b)
mean is a location parameter and b > 0 is a scale parameter.
mean and mode are both u; variance is 2b^2
puts more density st zero than guadsian, useful for encouraging sparsity in a model (?)
exponential distribution
special case of gamma distribution Ga(x | 1, #) where 1 is the shape ans # is the rate parameter. Describes the ti,es betweem events in a Poisson process (ie a process in which events occur continuously and independently at the constant average rate #)
chi-squared distribution
specia case of gamma distribution Ga(x | v/2, 1/2). Distribution of the sum of squared gaussian random variables.
erlang distribution
special case same as the gamma distribution where shape (a) is an integer, usually fixed at 2, yielding = Ga(x |2, #) where # is the rate parameter.
Events that occur independently with some average rate are modeled with a Poisson process. The waiting times between k occurrences of the event are Erlang distributed. (The related question of the number of events in a given amount of time is described by the Poisson distribution.)
beta distribution
a family of continuous probaiblity distributions dfeined on the interval [0,1] parametrized by two positive shape parameters, denoted by alpha and beta that appear as xponents of the random variable and control the shape of the distribution
has been applied to model the behavior of random variables limited to intervals of finite length in a wide variety of disciplines
Beta(alpha,beta) = [x^(alpha-1)(1-x)^(beta-1)]/Gamma(A)Gamma(B)
Where Gamma is the gamma function
gamma distribution
a flexible distribution for positive real valued rvs, x > 0
Defined in terms of two paramets shape a>0 and rate b>0. Ga(T|shape =a, rate = b) = [(b^a)/Ga(a)] * [T^(a-1)] * e ^(-Tb)
where Ga is the gamma function
gamma function
integral from 0 to infinity over u^(x-1)e^(-u) with respect to u
an extension of the factorial function, with its argument shifted down by 1, to real and complex numbers
pareto distribution
used to model the distribution of quantities that exhibit long tails/heavy tails. For example, word frequencies in english follow Zipf’s law. Wealth is similar skewed, esp in plutocracies like the US.
pfm is k * m^k * x^-(k+1) * I(x >= m(
Zipf’s law
Zipf’s law states that given a large sample of words used, the frequency of any word is inversely proportional to its rank in the frequency table
covariance
measurement of the degree to which X and Y are (linearly) related
cov[X,Y] = E[XY] - E[X]E[Y] or E[(X-E[X])(Y-E(Y)]
can be between 0 and infinity
(pearson) correlation coefficient
cov[X,Y]/sqrt(var(X)var(Y)). A normalized measure with a finite upper bound. Corr[X,Y] is 1 iff Y = aX + b and there’s a linear relationship between X and Y.
not related to the slope of the regression line, which is actually cov[X,Y]/var[X]
correlation implies dependence, but noncorrelation does not imply independence (another relationship might hold)
multivariate gaussian/normal (MVN)
most widely used joint probability density function for continuous variables; covered more in ch4
linearity of expectation
the property that the expected value of the sum of random variables is equal to the sum of their individual expected values, regardless of whether they are independent
linear transformation of random variable
y = f(x) = Ax + b
E[y] = E[Ax+b] = A(mu) + b where mu = E[x}.
cov[y] = cov[Ax+b] = A(cov[X])(transpose of A)
Mean and covariance only define the distribution of y if x is Gaussian.