Kursusgang 6 (Linear discrimination) Flashcards
How is the logistic sigmoid function defined?
𝜎(x) = 1 / [1 + exp(-x)]
How is the softmax function defined?
The softmax function is a multi-class generalization of the logistic sigmoid,
P(C_i | x) = exp(a_i) / [ \sum_{j=1}^K exp(a_j)]
for i = 1, … ,K
How is linear regression defined?
Function: h(x_n) = y(x_n, w) = w_1 * x_n+w_0
With loss function: squared error or mean squared error
How is logistic regression defined?
Function: h(x_n) = 𝜎 (w_1 * x_n + w_0) = 1 / [1+exp(-w * x_n + b)]
With loss function cross entropy error
− \sum_{c=1}^M y_{o,c} log(p_{o,c})
For two classes, it can simplify to
−[y * log(p) + (1 − y) * log(1 − p)]
where
M - number of classes (dog, cat, fish)
log - the natural log
y - binary indicator (0 or 1) if class label c is the correct classification for observation o
p - predicted probability observation o is of class c, 𝜎(w^T * x + w_0)
A perfect model would have a log loss of 0.
What are the applications of logistic regression?
It is a model for classification rather than regression, based on the classes.
How is the loss function of logistic regression minimized?
Minimization of L(w, w_0) with respect to w and w_0, is carried out by an iterative minimization scheme such as gradient descent.
What is gradient descent?
The idea is to take repeated steps in the opposite direction of the gradient (or approximate gradient) of the function at the current point, because this is the direction of steepest descent. Starts at a random point.
What is a linear discriminant function?
The decision boundaries are linear functions of the input vector x and are therefore defined by (D-1)-dimensional hyperplanes within the D-dimensional input space. Linear classification means that the part that adapts is linear.
However, the adaptive part is followed by a fixed non-linearity and may also be preceded by a fixed non-linearity.
How is the linear discriminant function for the special case of two classes?
g(x) = g_1(x) - g_2(x)
= (w_1^T * x + w_{10}) - (w_2^T * x + w_{20})
= (w_1 - w_2)^T * x + (w_{10} - w_{20}))
= w^T * x + w_0
Then choose C_1 if g(x) > 0, otherwise choose C_2.
What are the three approaches to learning the parameters of linear discriminant functions?
Least squares
* Each class is described by its own linear model
* Pleasant analytical properties
* Lack of robustness to outliers.
Fishers linear discriminant
* view a linear classification model as dimensionality reduction
The perceptron algorithm (gradient descent)
* Rosenblatt
Least squares for classification
- It reduces classification to least squares regression, whose optimal weights can be solved with some matrix algebra.
- It gives an exact closed-form solution for the discriminant function parameters.
- When there are more than two classes, we treat each class as a separate problem.
This is however not the right thing to do and it does not work as well as better methods, because it lacks robustness to outliers.
Why is least squares not robust to outliers?
The least-squares method (or equivalently, maximum likelihood under Gaussian assumptions) may not be appropriate if the actual conditional distribution deviates from being Gaussian.
Fisher’s linear discriminant for classification
One way to view a classification model is in terms of dimensionality reduction (projecting the data down to 1-D). Choosing the projection that gives the best separation of the classes in terms of variance, is the Fisher linear discriminant. Solved by using linear discriminant analysis, i.e. the eigenvector to the largest eigenvalue.
However, Fisher’s linear discriminant is more commonly used for dim reduction before classification than directly for classification
What is the perceptron algorithm?
A supervised linear binary classifier model. It has a nonlinear sign activation function,
+1, a ≥ 0,
-1, a < 0
How are the parameters for the perceptron determined?
There is a probability based approach and a discriminant-based approach.
- The parameters are the sufficient statistics of p(x | C_i) and p(C_i). The method to estimate the parameters is maximum likelihood.
- The parameters are optimized to minimize the classification error on the training set. There is often no analytical solution and thus we resort to iterative optimization methods: most commonly gradient descent.