2017 A1 Flashcards
Can the following integrate or has a value of more than one?
Probability Mass function
Probability Density function
Likelihood function
PMF - can’t be more than one [0, 1], can theoretically integrate to more than 1.
PDF - Can be more than one, can’t integrate to more than one (However does integrate to 1)
Likelihood - can be greater than 1 (Always bigger than 0), can’t integrate to more than one (Is a product of density functions,)
Explain the difference between a MLE and a MAP of a parameter w from data D
MLE - is a method ofestimatingtheparametersof an assumedprobability distribution, given some observed data. This is achieved bymaximizingalikelihood functionso that, under the assumedstatistical model, theobserved datais most probable. (Include equation)
MAP - Estimates the parameters of an unknown function using previous
distributions. (Include equation)
You are given an SVD of d x n centered data matrix D = UΣV^T representing a sample of d-dimensional observations.
Give the dimensions of U, Σ, & V^T.
U: d × d
Σ: d × n
V^T: n × n
Note Σ is always the same shape as D. A is the data matrix D
Suppose the rank of D is r. How many zero values are in Σ?
Assume that E(cross matrix) is now changed to an r x r matrix, the amount of zeros
in E are (r * r - r).
Derive the SVD of the sample covariances matrix in terms of the components of the data matrix SVD.
Let D be the data matrix with SVD:
D = UΣV^T
Therefore SVD of S i.t.o. the components of D’s SVD:
S = 1/N * D * D^T
= 1/N * (UΣV^T) * (UΣV^T)^T
= 1/N * UΣV^TVΣ^TU^T (Why does V dissapear?) <- V^TV= I (page 70)
= 1/N * UΣΣ^TU^T
= 1/N * UΣ^2U T
(Σ = Σ T as Σ is symmetrical)
= U * (Σ^2/N) * U^T
In the derivation of the projection-axis used for two-class LDA, one reaches the condition
(W^TSbW)SwW = (W^TSwW)SbW.
Explain how one proceeds to determine that the direction of W corresponds to that of Sw^(-1)(m2 - m1), where m1 and m2 are the means of the two classes.
Note that since we are only interested in the direction of w , and not its magnitude,
we can drop the scalar factors. Also Sb = (m2-1m)(m2 - m1)^TW points in the direction of m2 - m1 and we get that w ∝ Sw^(-1) (m2 -m1).
A common problem when attempting to fit a ML model is dealing with a multiple local optima in the objective function (e.g. the likelihood or the posterior). Logistic regression does not suffer from this problem. Why?
Because the Hessian of the Logistic Regression model is positive definite, thus it will
only have one global minimum, thus no concerns about multiple local optima arise