2013 Past Paper Flashcards
What are the two main conditions a function f(x) must meet in order to be a probability density function?
0 ≤ f(x)
and
∫ f(x) dx = 1
Explain the difference between a Likelihood function and a density function
- The likelihood is defined as the joint density of the observed data as a function of the parameter. The likelihood function is a function of the parameter only, with the data held as a fixed constant. Therefore, the likelihood function is not a pdf because its integral with respect to the parameter does not necessarily equal 1.
- A pdf is a non-negative function that integrates to 1. In probability theory, a probability density function (PDF) is a function that describes the relative likelihood for this random variable to take on a given value.
Michael Hochster, PhD in Statistics, Stanford; Director of Research, Pandora:
When we think of f(x,θ) as a likelihood, we instead hold x constant and let θ vary.
When we view f as a density, we have some constant values of θ in mind and think of the function as varying in x.
Use an example to differentiate between a likelihood function and a density function.
Consider the simple binomial case of a coin flip. Say we flip the coin 10 times and get 7 heads and 3 tails. If we know the coin is fair then we know the parameter value for the binomial equation ( p=0.5p=0.5 ), and then we can view the equation as an instantiation of a function of the number of heads x given p.
If, on the other hand, we are not sure if the coin is fair and do not know the value of p, then we can view equation (1) as an instantiation of a function of the probability of heads when the coin is flipped, given we flipped the coin 10 times and got 7 heads. Here 0.117 is the likelihood of p=0.5p=0.5 given our 10 flips and 7 heads.
So density is a function of possible values of the data given the model parameters.
Likelihood is a function of possible values of the model parameters given the data.
They are the same equation!
Name 3 approaches to parameter estimation discussed in this course:
- Maximum Likelihood Estimation
- Maximum A Posteriori (MAP estimate)
- Bayes Estimator (mean of the posterior distribution and minimum mean square error MMSE)
- Point estimates (Method of Moments)
- Kalman Filtering (Linear quadratic estimation (LQE))
Name 3 Approaches to optimization used in this course
- Gradient descent
- Newton-Rhapson Method
- Expectation Maximisation (EM)
Contrast generative and discriminative models
Generative classifiers learn a model of the joint probability, p(x,y), of the inputs x and the label y, and make their predictions by using Bayes rules to calculate p(y|x), and then picking the most likely label y.
Discriminative classifiers model the posterior p(y|x) directly, or learn a direct map from inputs x to the class labels.
Put from another perspective:
Discriminative models learn the (hard or soft) boundary between classes.
Generative models model the distribution of individual classes.
Reasons for using generative models rather than discriminative.
It is typically more straightforward to detect distribution changes and update a generative model accordingly than do this for a decision boundary in a discriminative model, especially if the updates need to be unsupervised. Discriminative models also do not generally function for outlier detection, though generative models generally do.
After fitting a generative model, you can also run them forward to generate synthetic data sets. You can partially set model parameters while running them forward to experiment with the effects that it has on the synthetic data, as well.
Reasons for using discriminative models rather than generative.
If the relationships expressed by your generative model only approximate the true underlying generative process that created your data, discriminative models will typically outperform in terms of classification error rates (and the amount of training data required).
Generally speaking, if you want an explanatory model that makes explicit claims about how the data is generated, then you’ll use a generative model. If you want to optimize classification accuracy and don’t need to make claims about how model parameters interact, then you’ll use a discriminative model.
Differentiate between PCA and LDA
PCA:
- Unsupervised
- Finds lower dimensional subspaces that describe the essential properties of the data.
- Chooses these subspaces using the directions of maximal variation in the data (principle components).
LDA:
- Supervised
- Finds the lower-dimensional subspace that maximises class separation.
- LDA uses the class labels to maximise class separation between data.
The maximum possible number of Principle component axes.
Number of components ≤ min(n, d)
n = number of samples d = number of features (dimension of data)
The maximum possible number of linear discriminant axes.
Number of components ≤ min( k-1 , d)
k = number of classes d = number of features (dimension of data)
Define the EM algorithm goal
The Expectation–maximization (EM) algorithm is technically just an iterative method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables.
w
(E) step: creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters
(M) step: computes parameters maximizing the expected log-likelihood found on the E step.
In what way can the EM algorithm be seen as a (likelihood) maximization procedure.
EM algorithm is a joint maximization procedure that iteratively maximizes a better and better lower bound F to the true likelihood L(θ).
How many parameters are needed to specify a Gaussian mixture model for d-dimensional data using K components in the general case (full covariance matrices)
(1/2) Kd(d+3)
How many parameters are needed to specify a Gaussian mixture model for d-dimensional data using K components when the covariance matrices are diagonal
2 Kd