Probabilistic Generative Models Flashcards
What are generalized linear models?
These are models making predictions such that y(x) = f(g(x, w)), where g is a linear function of w and f is a non-linear function.
N.B.: Decision surfaces still are linear functions of x.
In a 2-class classificiation problem, what is the prediction of a probabilistic generative model?
P(C1|x) = 1 / (1+exp(-a)) = σ(a)
and P(C2|x) = 1 - σ(a),
where a = - ln[ (P(x|C1)*P(C1)) / (P(x|C2)*P(C2)) ]
(σ is the sigmoid function)
In a k-class classificiation problem, what is the prediction of a probabilistic generative model?
P(Ck|x) = exp(ak) / Σj exp(aj) = softmax_k(x),
where aj = ln[ P(x|Cj)*P(Cj) ]
How are called the terms of P(Ck|x) = P(x|Ck)*P(Ck) / P(x)?
P(Ck) is the prior probability, P(Ck|x) is the posterior probability, P(x) is the density of X and P(x|Ck) is the class conditional density.
What is the inductive bias of Gaussian Discriminant Analysis (GDA), a.k.a. “Linear Discriminant Analysis (LDA)”?
- All class conditional densities are Gaussian.
- All classes share the same covariance matrix.
What are the parameters of GDA?
The prior probabilities P(Ck) and Gaussian means μk of each class, and their common covariance matrix Σ.
In a 2-class classificiation problem, how is the prediction calculated by GDA?
P(C1|x) = σ(w.T * x + w0),
where w = Σ^-1 * (μ1 - μ2)
and w0 = -1/2 * μ1.T * Σ^-1 * μ1 + 1/2 * μ2.T * Σ^-1 * μ2 + ln(P(C1)/P(C2))
In a 2-class classificiation problem, what is the likelihood of the datapoint (x, t)?
[P(C1) * N(x|μ1, Σ)]^t * [P(C2) * N(x|μ2, Σ)]^(1-t)
What is the i.i.d. assumption?
All points of a dataset are independant and identically distributed.
How are learned the parameters of GDA?
By maximizing the likelihood function on the training set (i.e. the product of the likelihood of each training point).
What are the optimal values of P(Ck), μk and Σ using GDA and max. log. likelihood?
The optimal value of P(Ck) is the frequency of Ck in the training set.
The optimal value of μk is the x value averaged over all Ck training examples.
The optimal value of Σ is the weighted average of the covariance matrix of each class.
What is Quadratic Discriminant Analysis (QDA)?
It’s like GDA but without making the assumption that all classes have the same covariance matrix.
What is the Naive Bayes (NB) assumption?
Features are conditionaly independant, given the class label, i.e. P(X|Ck) = P(X1, X2, … |Ck) ~= P(X1|Ck) * P(X2|Ck) * …
What are Probabilisitc Graphical Models (PGM)?
It’s a trade-off between GDA (where all features are considered dependent) and NB (where all features are considered independent): the dependencies of features are set according to a dependency graph.
What is Diagonal GDA (DGDA)?
It’s GDA, making the NB assumption (the covariance matrix is then diagonal).