Week 5: Statistical Modelling Flashcards
Probability Distributions
This approach models uncertainty and quantifies our degree of belief that something will happen.
Probability Distribution Function (PDF)
The area under the curve between two points of a PDF is the probability of the outcome being within the two points.
Cumulative Distribution Function (PDF)
The height of the curve at a point is the chance that the outcome is less than or equal to the point.
Joint Distribution
It’s the probability distribution of all the random variables in the set.
Independence
A and B and independent if
P(A|B) = P(A),
P(B|A) = P(B),
P(A,B) = P(A)*P(B)
Conditional Independence
A and B are conditionally independent given C
iff P(A,B|C) = P(A|C) * P(B|C), or
iff P(A|B,C) = P(A|C)
Conditional independence doesn’t imply unconditional independence or the other way around.
Representative Sample
A sample from a population that accurately reflects the characteristics of the population.
Prior
The initial probability that hypothesis h holds without having observed the data.
P(h)
Likelihood
The probability of observing data D, given some world where the hypothesis h is true.
P(D|h)
Posterior
The probability that hypothesis h is true, given that we have observed dataset D.
P(h|D)
Likelihoods
When modelling a random process, we don’t know the hypothesis h. We estimate the parameters of a model h by maximising the probability P(D|h) (or L(h|D)) of observing D. Hypotheses aren’t always mutually exclusive and there can be an infinite number of them.
Maximum Likelihood Estimate (MLE)
Calculate the parameters of so to maximise the likelihood L(h|D).
Goal is \arg \max_h \left{L(h \mid D) \right}
L(h \mid D) = P(D \mid h) = \prod_{i=1}^m P(\boldsymbol{x}_i \mid h)
Bayesian Estimation
Compute a model h of maximum posterior probability Pr(h \mid D)
Goal is \arg \max_h \left{ P(h \mid D) \right}
Using Bayes Rule,
P(h \mid D) = \frac{P(D \mid h) \cdot P(h)}{P(D)}
This assumes conditional independence.
Conditional Independence Assumption
Can multiply all probabilities given assumption and the probability of the assumption together in the numerator divided by the probability of the data.
Probability Density Function for Normal Distribution
f(x) = \frac{1}{\sigma \sqrt{2 \pi}}e^{-\frac{1}{2} \left( \frac{(x - \mu)^2}{\sigma^2} \right)}
Laplace Estimation
When computing likelihoods for each possible attribute value, add 1 to the numerator and \ell to the denominator. This allows for non-zero denominators
Density Estimation
Given a dataset, compute an estimate of an underlying probability density function.
Parametric Models
The number of parameters is fixed and independent of the training set size. These are approximations of reality and incorporate stronger assumptions than non-parametric models. They’re generally more explainable and enable deeper investigations.
Examples include:
- Multivariate Linear Regression
- Neural Networks
- k-Means
- Gaussian
Non-parametric Models
The number of parameters grows as the sample size increases. They have modelling power for getting stronger representations.
Examples include:
- Decision Tree
- DBSCAN
Multivariate Gaussian/Normal Distribution
f(\boldsymbol{x}) = \frac{1}{(2\pi)^{\frac{n}{2}} \left\lvert \boldsymbol{\Sigma} \right\rvert ^{\frac{1}{2}}} e^{-\frac{1}{2}(\boldsymbol{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^_{-1} (\boldsymbol{x} - \boldsymbol{\mu})}
Iso-density Contours
In these contours, all the points x have equal density. f(x) = c.
This is similar to the elevation maps used in topography.
Poisson Distribution
f(x) = \frac{\delta^x e^{-\delta}}{x !}
\delta = rate at which the events occur
x = random variable corresponding to the number of events
Mixture Model
Consists of multiple component models each one specified by its own parameters.
f(\boldsymbol{x}) = \sum_{k=1}^K \pi_k f_k (\boldsymbol{x}; \boldsymbol{w}_k)
Log-likelihood of a Mixture Model
L(\boldsymbol{\pi}, \boldsymbol{w}_1,…,\boldsymbol{w}K) = \log \left[ \sum{k=1}^K \pi_k f_k (\boldsymbol{x}; \boldsymbol{w}_k) \right]
Gaussian Mixture Model (GMM)
P(\boldsymbol{x}i) = \sum{k=1}^K P(C_k) P(\boldsymbol{x}_i \mid C_k)
Expectation-Maximisation (EM) Algorithm
Well-known algorithm for computing GMM’s.
E-Step: compute \pi_{i,k} = P(C_k \mid \boldsymbol{x}i). Using Bayes rule, compute P(\boldsymbol{x}i \mid C_k) P(C_k). m_k = \sum{i=1}^m \pi{i,k}
M-step: compute the new means, covariances, and component weights
\boldsymbol{\mu}k \leftarrow \sum{i=1}^m \left( \frac{\pi_{i,k}}{m_k} \right) \boldsymbol{x}i
\boldsymbol{\Sigma}k \leftarrow \sum{i=1}^m \left( \frac{\pi{i,k}}{m_k}\right) (\boldsymbol{x}_i - \boldsymbol{\mu}_j) (\boldsymbol{x}_i - \boldsymbol{\mu}_j)^T
\pi_k \leftarrow \frac{m_k}{m}