2016 A2 Flashcards
Are X and Y independent.
No. P(X|Y) as a function of X is dependent on the value of Y.
What assumption allows us to decompose the join probability P(XT_i=1 = x^T_i=1) of a set of observations into a product of marginal probabilities {PRODUCT}^T_i=1 P(X_i = x_i)?
The data points are assumed to be independent and identically distributed.
Explain the difference between likelihood function and a density function using an example.
A density function is a function of your data. It tells you how likely it is that certain data
points will appear given that the parameters are fixed. (Integrates to 1)
A likelihood function takes the data as a given and represents the likeliness of different
parameters for your distribution. (If possible to integrate, integrates to a value other than 1)
Explain the difference between MAP and MLE of a parameter w from data D. Refer explicitly to relevant (conditional) probability densities and other formulae in your answer.
D=X and w = theta
MLE - is a method of estimating the parameters of a statistical model given
observations, by finding the parameter values that maximize the log-likelihood of
making the observations given the parameters. (Insert theta MLE = arg..)
MAP - is an estimate of an unknown quantity, that equals the mode of the posterior
distribution.
In both MAP and MLE approaches, name three approaches used in this course to perform a optimization operation.
Newton Rhapson
Gradient Descent
Lagrange
Contrast Generative vs Discriminative models and explain when each might be appropriate.
A generative algorithm models how the data was generated and ignores
dependencies. It trains the model one class at a time, thus it is a bit wasteful with the
data.
A discriminative algorithm does not care about how the data was generated and
considers dependencies. It trains the model with all the data and its classes at the
same time. This is unwasteful, but more difficult to train
Specify as exactly as possible the number of principle component axes and Linear discriminant axes ones obtains, in there terms of properties of the data matrix and the number of classes (as applicable)
d = dimensions
k - classes
Principal component axes is d-1 with min(d,N) [Essentially either or]
Linear discriminant axes is k-1
Data is generated from two classes (C1, C2), and you are interested in modeling the probability P(C1|x) of a new observation x being generated by the first class.
If the observations for the classes are generated by multi-dimensional Gaussian distributions, what condition(s) on the parameters of these distributions lead to log-posterior-odds - appearing as the argument of the logistic (sigmoid) function, being a linear function of the observation x.
Assuming Gaussian densities with a shared covariance
Explain why regulization is important for logistic regression. Indicate what regularization function corresponds to a zero-mean Gaussian prior for the weight parameter.
Some kind of `regularization’ of discriminative schemes is essential in practice to
avoid specialization or overtuning.
The regularization term, (1/ 2λ)w(^T) w, corresponds to a zero-mean Gaussian
prior for the weight parameter.
How many parameters are needed to specify a Gaussian mixture model for d-dimensional data using K components:
- In the general case (full covariance matrices)?
- when the covariance matrices are diagonal?
General case: ½kd(d+3)
Diagonal covariance: 2kd
Your employer provides you with a data set from a client for training a GMM. The data is split into three parts.
What are these three parts?
Training set
Validation set
Test set
our employer provides you with a data set from a client for training a GMM. The data is split into three parts, Training set, a validation set, and a test set. Explain how you would use these different sets in developing a final fitted GMM, along with an estimated quality of the model.
Training set : Used to train the model
Validation set : Used to provide an unbiased estimate of the model fit and estimate
hyperparameters and amount of regularization
Test set : Used to provide an unbiased estimate of the final model. Note the model
should not be tuned any further when used on the test set.
Formulate the key assumptions that give rise to the hidden Markov model.
The first-order Markov assumption - the current observation is conditionally independent of the previous observation and states, given that you know the current state.
(write out formula)
The observation independence assumption - the current state is conditionally independent of the previous observation and state, given that you know the previous state.
(write out formula)
Explain why it is often preferable for code working with likelihoods to represent quantities in the log-domain.
When multiplying long sequences of probabilities with the values between 0 and 1, the product becomes smaller and smaller over time, resulting in numerical underflow.
Therefore, a log-scale is used and the log-sum trick solves the problem of underflow through summation. The log ensures that the values don’t necessarily lie between 0 and 1 due to the natural log, so they won’t get smaller over time when being multiplied together.