2013 Past Paper Flashcards

1
Q

What are the two main conditions a function f(x) must meet in order to be a probability density function?

A

0 ≤ f(x)
and
∫ f(x) dx = 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Explain the difference between a Likelihood function and a density function

A
  • The likelihood is defined as the joint density of the observed data as a function of the parameter. The likelihood function is a function of the parameter only, with the data held as a fixed constant. Therefore, the likelihood function is not a pdf because its integral with respect to the parameter does not necessarily equal 1.
  • A pdf is a non-negative function that integrates to 1. In probability theory, a probability density function (PDF) is a function that describes the relative likelihood for this random variable to take on a given value.

Michael Hochster, PhD in Statistics, Stanford; Director of Research, Pandora:
When we think of f(x,θ) as a likelihood, we instead hold x constant and let θ vary.
When we view f as a density, we have some constant values of θ in mind and think of the function as varying in x.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Use an example to differentiate between a likelihood function and a density function.

A

Consider the simple binomial case of a coin flip. Say we flip the coin 10 times and get 7 heads and 3 tails. If we know the coin is fair then we know the parameter value for the binomial equation ( p=0.5p=0.5 ), and then we can view the equation as an instantiation of a function of the number of heads x given p.

If, on the other hand, we are not sure if the coin is fair and do not know the value of p, then we can view equation (1) as an instantiation of a function of the probability of heads when the coin is flipped, given we flipped the coin 10 times and got 7 heads. Here 0.117 is the likelihood of p=0.5p=0.5 given our 10 flips and 7 heads.

So density is a function of possible values of the data given the model parameters.
Likelihood is a function of possible values of the model parameters given the data.
They are the same equation!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Name 3 approaches to parameter estimation discussed in this course:

A
  • Maximum Likelihood Estimation
  • Maximum A Posteriori (MAP estimate)
  • Bayes Estimator (mean of the posterior distribution and minimum mean square error MMSE)
  • Point estimates (Method of Moments)
  • Kalman Filtering (Linear quadratic estimation (LQE))
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Name 3 Approaches to optimization used in this course

A
  • Gradient descent
  • Newton-Rhapson Method
  • Expectation Maximisation (EM)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Contrast generative and discriminative models

A

Generative classifiers learn a model of the joint probability, p(x,y), of the inputs x and the label y, and make their predictions by using Bayes rules to calculate p(y|x), and then picking the most likely label y.

Discriminative classifiers model the posterior p(y|x) directly, or learn a direct map from inputs x to the class labels.

Put from another perspective:
Discriminative models learn the (hard or soft) boundary between classes.
Generative models model the distribution of individual classes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Reasons for using generative models rather than discriminative.

A

It is typically more straightforward to detect distribution changes and update a generative model accordingly than do this for a decision boundary in a discriminative model, especially if the updates need to be unsupervised. Discriminative models also do not generally function for outlier detection, though generative models generally do.

After fitting a generative model, you can also run them forward to generate synthetic data sets. You can partially set model parameters while running them forward to experiment with the effects that it has on the synthetic data, as well.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Reasons for using discriminative models rather than generative.

A

If the relationships expressed by your generative model only approximate the true underlying generative process that created your data, discriminative models will typically outperform in terms of classification error rates (and the amount of training data required).

Generally speaking, if you want an explanatory model that makes explicit claims about how the data is generated, then you’ll use a generative model. If you want to optimize classification accuracy and don’t need to make claims about how model parameters interact, then you’ll use a discriminative model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Differentiate between PCA and LDA

A

PCA:

  • Unsupervised
  • Finds lower dimensional subspaces that describe the essential properties of the data.
  • Chooses these subspaces using the directions of maximal variation in the data (principle components).

LDA:

  • Supervised
  • Finds the lower-dimensional subspace that maximises class separation.
  • LDA uses the class labels to maximise class separation between data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

The maximum possible number of Principle component axes.

A

Number of components ≤ min(n, d)

n = number of samples
d = number of features (dimension of data)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

The maximum possible number of linear discriminant axes.

A

Number of components ≤ min( k-1 , d)

k = number of classes
d = number of features (dimension of data)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Define the EM algorithm goal

A

The Expectation–maximization (EM) algorithm is technically just an iterative method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables.
w
(E) step: creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters
(M) step: computes parameters maximizing the expected log-likelihood found on the E step.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

In what way can the EM algorithm be seen as a (likelihood) maximization procedure.

A

EM algorithm is a joint maximization procedure that iteratively maximizes a better and better lower bound F to the true likelihood L(θ).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How many parameters are needed to specify a Gaussian mixture model for d-dimensional data using K components in the general case (full covariance matrices)

A

(1/2) Kd(d+3)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How many parameters are needed to specify a Gaussian mixture model for d-dimensional data using K components when the covariance matrices are diagonal

A

2 Kd

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Name 3 assumptions that give rise to the Hidden Markov Model

A
  • Observational Independence
  • First-Order Markov Assumption
  • Time-independent Transitions
17
Q

Observational Independence

A

The likelihood of the t’th feature vector depends only on the current (t’th) state and is therefore otherwise unaffected by previous states and feature vectors.

18
Q

First Order Markov Assumption

A

Apart from the immediately preceding state, no other previously observed states or features affect the probability of occurrence of the next state.

19
Q

Time-independent Transitions

A

Assume the transition probability between two states to be constant, irrespective of the time when the transition actually takes place.

20
Q

What is the Forward algorithm used for?

A

Used for likelihood computation. It computes the observation probability by summing over all possible hidden state paths that could generate the observation sequence.

21
Q

What is the Viterbi used for?

A

Used for decoding the observed sequence by finding the most probable state sequence associated with it. It finds the optimal sequence of hidden states. Given an observation sequence and an HMM, it returns the state path through the HMM that assigns maximum likelihood to the observations.

22
Q

What is the Backwards algorithm used for?

A

Used to train the HMM parameters in Baum-welch re-estimation.

23
Q

Contrast discriminative and generative models (TEXTBOOK)

A

In the case of generative models, a model is trained for each class, totally ignoring the properties of the properties of the other classes.
Discriminative models use all the training data simultaneously to generate the model. It can be used to good effect to try and maximise the differences between classes.

24
Q

Generative Approach (TEXTBOOK)

A

A model is developed for every class from the observations known to belong to that class.

Once this model is known, it is in principle possible to generate observations for each class.

25
Q

Discriminative Approach (TEXTBOOK)

A

Directly estimates the posterior from the data. This has the advantage that the data is used to best effect in order to discriminate between the classes.

Thus training data is used more effectively in distinguishing between classes.