2013 Past Paper Flashcards

Question 1

Q

What are the two main conditions a function f(x) must meet in order to be a probability density function?

Answer

A

0 ≤ f(x)
and
∫ f(x) dx = 1

Question 2

Q

Explain the difference between a Likelihood function and a density function

Answer

A

The likelihood is defined as the joint density of the observed data as a function of the parameter. The likelihood function is a function of the parameter only, with the data held as a fixed constant. Therefore, the likelihood function is not a pdf because its integral with respect to the parameter does not necessarily equal 1.
A pdf is a non-negative function that integrates to 1. In probability theory, a probability density function (PDF) is a function that describes the relative likelihood for this random variable to take on a given value.

Michael Hochster, PhD in Statistics, Stanford; Director of Research, Pandora:
When we think of f(x,θ) as a likelihood, we instead hold x constant and let θ vary.
When we view f as a density, we have some constant values of θ in mind and think of the function as varying in x.

Question 3

Q

Use an example to differentiate between a likelihood function and a density function.

Answer

A

Consider the simple binomial case of a coin flip. Say we flip the coin 10 times and get 7 heads and 3 tails. If we know the coin is fair then we know the parameter value for the binomial equation ( p=0.5p=0.5 ), and then we can view the equation as an instantiation of a function of the number of heads x given p.

If, on the other hand, we are not sure if the coin is fair and do not know the value of p, then we can view equation (1) as an instantiation of a function of the probability of heads when the coin is flipped, given we flipped the coin 10 times and got 7 heads. Here 0.117 is the likelihood of p=0.5p=0.5 given our 10 flips and 7 heads.

So density is a function of possible values of the data given the model parameters.
Likelihood is a function of possible values of the model parameters given the data.
They are the same equation!

Question 4

Q

Name 3 approaches to parameter estimation discussed in this course:

Answer

A

Maximum Likelihood Estimation
Maximum A Posteriori (MAP estimate)
Bayes Estimator (mean of the posterior distribution and minimum mean square error MMSE)
Point estimates (Method of Moments)
Kalman Filtering (Linear quadratic estimation (LQE))

Question 5

Q

Name 3 Approaches to optimization used in this course

Answer

A

Gradient descent
Newton-Rhapson Method
Expectation Maximisation (EM)

Question 6

Q

Contrast generative and discriminative models

Answer

A

Generative classifiers learn a model of the joint probability, p(x,y), of the inputs x and the label y, and make their predictions by using Bayes rules to calculate p(y|x), and then picking the most likely label y.

Discriminative classifiers model the posterior p(y|x) directly, or learn a direct map from inputs x to the class labels.

Put from another perspective:
Discriminative models learn the (hard or soft) boundary between classes.
Generative models model the distribution of individual classes.

Question 7

Q

Reasons for using generative models rather than discriminative.

Answer

A

It is typically more straightforward to detect distribution changes and update a generative model accordingly than do this for a decision boundary in a discriminative model, especially if the updates need to be unsupervised. Discriminative models also do not generally function for outlier detection, though generative models generally do.

After fitting a generative model, you can also run them forward to generate synthetic data sets. You can partially set model parameters while running them forward to experiment with the effects that it has on the synthetic data, as well.

Question 8

Q

Reasons for using discriminative models rather than generative.

Answer

A

If the relationships expressed by your generative model only approximate the true underlying generative process that created your data, discriminative models will typically outperform in terms of classification error rates (and the amount of training data required).

Generally speaking, if you want an explanatory model that makes explicit claims about how the data is generated, then you’ll use a generative model. If you want to optimize classification accuracy and don’t need to make claims about how model parameters interact, then you’ll use a discriminative model.

Question 9

Q

Differentiate between PCA and LDA

Answer

A

PCA:

Unsupervised
Finds lower dimensional subspaces that describe the essential properties of the data.
Chooses these subspaces using the directions of maximal variation in the data (principle components).

LDA:

Supervised
Finds the lower-dimensional subspace that maximises class separation.
LDA uses the class labels to maximise class separation between data.

Question 10

Q

The maximum possible number of Principle component axes.

Answer

A

Number of components ≤ min(n, d)

n = number of samples
d = number of features (dimension of data)

Question 11

Q

The maximum possible number of linear discriminant axes.

Answer

A

Number of components ≤ min( k-1 , d)

k = number of classes
d = number of features (dimension of data)

Question 12

Q

Define the EM algorithm goal

Answer

A

The Expectation–maximization (EM) algorithm is technically just an iterative method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables.
w
(E) step: creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters
(M) step: computes parameters maximizing the expected log-likelihood found on the E step.

Question 13

Q

In what way can the EM algorithm be seen as a (likelihood) maximization procedure.

Answer

A

EM algorithm is a joint maximization procedure that iteratively maximizes a better and better lower bound F to the true likelihood L(θ).

Question 14

Q

How many parameters are needed to specify a Gaussian mixture model for d-dimensional data using K components in the general case (full covariance matrices)

Answer

A

(1/2) Kd(d+3)

Question 15

Q

How many parameters are needed to specify a Gaussian mixture model for d-dimensional data using K components when the covariance matrices are diagonal

Question 16

Q

Name 3 assumptions that give rise to the Hidden Markov Model

Answer

A

Observational Independence
First-Order Markov Assumption
Time-independent Transitions

Question 17

Q

Observational Independence

Answer

A

The likelihood of the t’th feature vector depends only on the current (t’th) state and is therefore otherwise unaffected by previous states and feature vectors.

Question 18

Q

First Order Markov Assumption

Answer

A

Apart from the immediately preceding state, no other previously observed states or features affect the probability of occurrence of the next state.

Question 19

Q

Time-independent Transitions

Answer

A

Assume the transition probability between two states to be constant, irrespective of the time when the transition actually takes place.

Question 20

Q

What is the Forward algorithm used for?

Answer

A

Used for likelihood computation. It computes the observation probability by summing over all possible hidden state paths that could generate the observation sequence.

Question 21

Q

What is the Viterbi used for?

Answer

A

Used for decoding the observed sequence by finding the most probable state sequence associated with it. It finds the optimal sequence of hidden states. Given an observation sequence and an HMM, it returns the state path through the HMM that assigns maximum likelihood to the observations.

Question 22

Q

What is the Backwards algorithm used for?

Answer

A

Used to train the HMM parameters in Baum-welch re-estimation.

Question 23

Q

Contrast discriminative and generative models (TEXTBOOK)

Answer

A

In the case of generative models, a model is trained for each class, totally ignoring the properties of the properties of the other classes.
Discriminative models use all the training data simultaneously to generate the model. It can be used to good effect to try and maximise the differences between classes.

Question 24

Q

Generative Approach (TEXTBOOK)

Answer

A

A model is developed for every class from the observations known to belong to that class.

Once this model is known, it is in principle possible to generate observations for each class.

Question 25

Q

Discriminative Approach (TEXTBOOK)

Answer

A

Directly estimates the posterior from the data. This has the advantage that the data is used to best effect in order to discriminate between the classes.

Thus training data is used more effectively in distinguishing between classes.