Kursusgang 6 (Linear discrimination) Flashcards

1
Q

How is the logistic sigmoid function defined?

A

𝜎(x) = 1 / [1 + exp(-x)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How is the softmax function defined?

A

The softmax function is a multi-class generalization of the logistic sigmoid,
P(C_i | x) = exp(a_i) / [ \sum_{j=1}^K exp(a_j)]
for i = 1, … ,K

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How is linear regression defined?

A

Function: h(x_n) = y(x_n, w) = w_1 * x_n+w_0
With loss function: squared error or mean squared error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How is logistic regression defined?

A

Function: h(x_n) = 𝜎 (w_1 * x_n + w_0) = 1 / [1+exp(-w * x_n + b)]
With loss function cross entropy error

− \sum_{c=1}^M y_{o,c} log(p_{o,c})

For two classes, it can simplify to

−[y * log(p) + (1 − y) * log(1 − p)]

where
M - number of classes (dog, cat, fish)
log - the natural log
y - binary indicator (0 or 1) if class label c is the correct classification for observation o
p - predicted probability observation o is of class c, 𝜎(w^T * x + w_0)

A perfect model would have a log loss of 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the applications of logistic regression?

A

It is a model for classification rather than regression, based on the classes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How is the loss function of logistic regression minimized?

A

Minimization of L(w, w_0) with respect to w and w_0, is carried out by an iterative minimization scheme such as gradient descent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is gradient descent?

A

The idea is to take repeated steps in the opposite direction of the gradient (or approximate gradient) of the function at the current point, because this is the direction of steepest descent. Starts at a random point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a linear discriminant function?

A

The decision boundaries are linear functions of the input vector x and are therefore defined by (D-1)-dimensional hyperplanes within the D-dimensional input space. Linear classification means that the part that adapts is linear.
However, the adaptive part is followed by a fixed non-linearity and may also be preceded by a fixed non-linearity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How is the linear discriminant function for the special case of two classes?

A

g(x) = g_1(x) - g_2(x)
= (w_1^T * x + w_{10}) - (w_2^T * x + w_{20})
= (w_1 - w_2)^T * x + (w_{10} - w_{20}))
= w^T * x + w_0

Then choose C_1 if g(x) > 0, otherwise choose C_2.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the three approaches to learning the parameters of linear discriminant functions?

A

Least squares
* Each class is described by its own linear model
* Pleasant analytical properties
* Lack of robustness to outliers.

Fishers linear discriminant
* view a linear classification model as dimensionality reduction

The perceptron algorithm (gradient descent)
* Rosenblatt

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Least squares for classification

A
  • It reduces classification to least squares regression, whose optimal weights can be solved with some matrix algebra.
  • It gives an exact closed-form solution for the discriminant function parameters.
  • When there are more than two classes, we treat each class as a separate problem.

This is however not the right thing to do and it does not work as well as better methods, because it lacks robustness to outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why is least squares not robust to outliers?

A

The least-squares method (or equivalently, maximum likelihood under Gaussian assumptions) may not be appropriate if the actual conditional distribution deviates from being Gaussian.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Fisher’s linear discriminant for classification

A

One way to view a classification model is in terms of dimensionality reduction (projecting the data down to 1-D). Choosing the projection that gives the best separation of the classes in terms of variance, is the Fisher linear discriminant. Solved by using linear discriminant analysis, i.e. the eigenvector to the largest eigenvalue.

However, Fisher’s linear discriminant is more commonly used for dim reduction before classification than directly for classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the perceptron algorithm?

A

A supervised linear binary classifier model. It has a nonlinear sign activation function,
+1, a ≥ 0,
-1, a < 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How are the parameters for the perceptron determined?

A

There is a probability based approach and a discriminant-based approach.

  • The parameters are the sufficient statistics of p(x | C_i) and p(C_i). The method to estimate the parameters is maximum likelihood.
  • The parameters are optimized to minimize the classification error on the training set. There is often no analytical solution and thus we resort to iterative optimization methods: most commonly gradient descent.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the different error functions for the perceptron and their impact on minimization?

A

The total number of misclassified patterns
* a piecewise constant function of w, which does not lead to a simple learning algorithm to be able to change w using the gradient of the error function

Perceptron criterion
* When using the t ∈ {-1, 1} target coding scheme, all patterns should satisfy
w^t ϕ(x_n) * t_n > 0
It is defined as
E_p (w) = - \sum_{n∈M} w^t * ϕ_n * t_n
where M is the set of all misclassified patterns.

17
Q

What is the perceptron convergence procedure?

A

Introduce the bias, an extra component to each feature vector. This is set equal to minus the threshold in the activation function, a. Then:
* If the pattern is correctly classified, weight vector remains unchanged.
* If the output is -1 but should be 1, add the vector ϕ(x_n) to the weight vector.
* If the output is 1 but should be -1, subtract the vector ϕ(x_n) from the weight vector

If the classes are linearly separable, then the Perceptron learning procedure will converge to a solution in a finite number of steps.

18
Q

What can perceptrons not learn?

A

It can not learn XOR, i.e.
x1​ x2​ r
0 0 0
0 1 1
1 0 1
1 1 0

19
Q

How can a given problem be made linearly separable?

A

By using nonlinear basis functions and then training based on the new data points in the basis.

20
Q

Compare logistic regression with perceptron

A

Logistic regression:
Can handle non-linearly separable data using probabilities and regularization.
Provides probabilities of class membership.
Uses a cross-entropy loss function.
Robust to noisy and overlapping data.

Perceptron:
Converges only if data is linearly separable (use non linear basis)
Produces only hard class labels (−1 or +1).
No explicit loss; uses a threshold-based rule.
Sensitive to noisy or overlapping data.

21
Q

Compare PCA and LDA

A

Purpose: LDA focuses on class separation, while PCA focuses on capturing variance.
Supervision: LDA uses class labels; PCA does not.
Components: LDA is limited by the number of classes (C−1), while PCA is limited by the data dimensions.
Application: LDA is better suited for classification tasks, whereas PCA is more general-purpose for dimensionality reduction and data preprocessing.