lecture 4 - classification Flashcards

1
Q

How do linear models handle classification tasks?

A

Linear models for classification take an input vector x and map it onto one of K discrete classes by using a separable hyperplane in the input space.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How are linear models separated in a D-dimensional input space?

A

They are separated by (D−1)-dimensional hyperplanes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How are linear models represented in regression?

A

y(x) = w^T x + w_0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the role of the activation function in classification with linear models?

A
  • The activation function f(⋅) maps the output of the linear model to discrete classes, converting the continuous output into a class label.
  • This makes the model nonlinear in its outputs while the underlying equation remains linear.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How can a step function be used for classification?

A

A step function can assign

  1. y(x)>0 to “class 1”
  2. y(x)≤0 to “class 2”
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a discriminant function in classification tasks?

A

A discriminant function is a mathematical function used to separate data points into distinct classes by mapping input features to a decision boundary.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the simplest form of a discriminant function for a 2-class classification problem?

A

y(x) = w^T x + w_0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How is the decision boundary defined in classification?

A

The decision boundary is the set of all points x that satisfy:

  • y(x) = w^T x + w_0 = 0
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How are classes assigned based on the discriminant function?

A
  1. y(x)>0 to “class 1”
  2. y(x)<0 to “class 2”
    - y(x)=0 is the decision boundary
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

If points x_a and x_b lie on the decision surface, then:

A
  1. y(x_a) = y(x_b) = 0

therefore

  1. w^T (x_a - x_b) = 0 (dot product of the two vectors is zero)

this indicates that w is orthogonal to every vector lying within the decision surface

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How is the decision boundary equation interpreted in terms of projection and bias?

A
  1. The left side represents the projection of x onto w and determines how far x is from the boundary
  2. The right side represents the displacement or location of the decision boundary relative to the origin (w_0)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why is using multiple 2-class classifiers for K-class classification not ideal?

A

It can lead to ambiguous regions where boundaries overlap, making it unclear which class a point belongs to.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the solution for well-defined decision boundaries in K-class classification?

A

Use a unified K-class classifier where each class C_k has its own discriminant linear function of the form y_k(x) = w_k^T x + w_k0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the decision rule for assigning a point to a class in K-class classification?

A

A point x belongs to class C_k if y_k(x) > y_j(x) for all j =/= k

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How is the decision boundary between two classes defined in K-class classification?

A
  • The boundary between classes C_k and C_j occurs when their scores are equal
  • y_k(x) = y_j(x)
  • this results in a (D-1) dimensional hyperplane
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the general form of the hyperplane between two classes C_k and C_j?

A

(w_k - w_j)^T x + (w_k0 - w_j0) = 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is a property of the decision boundary in K-class classification?

A

Linearity of the discriminant functions makes the decision boundary in a K-class classifier singly connected and convex

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the steps for assigning a class using a K-class classifier?

A
  1. Define the linear discriminants for each class.
  2. Assign weights and biases for each class.
  3. Calculate the discriminant scores for a given point.
  4. Assign the point to the class with the highest score.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How are weights and biases assigned for each class in K-class classification?

A

Each class C_k is assigned a weight vector w_k and a bias w_k0 which define its discriminant function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What does the discriminant score represent in a K-class classifier?

A

The discriminant score represents how strongly a data point is associated with a specific class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is a perceptron

A
  • The perceptron is the first model that could learn.
  • It is a linear model with a step activation function, classifying inputs into two distinct categories.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How does the step activation function in a perceptron work?

A
  • y(x) = f(w^t ϕ(x))
  • f(a)
  • if a is positive or zero, it outputs +1, indicating one class.
  • if a is negative, it outputs -1, indicating another class.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What criterion is used for training a perceptron?

A

Training is done using the perceptron criterion, which focuses on minimizing the total error function E_p

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What does the total error function E_p represent in perceptron training?

A
  • E_p is a score that tells how “wrong” the perceptron is on the points it misclassifies, based on the sum of predicted output multiplied by target output for the misclassified points.
  • focuses only on the n misclassified points in M
  • computes the sum of the terms (w^⊤ϕ_n​ t_n​), which is the weight vectr * predicted output * target output
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Why is direct misclassification using the total number of misclassified patterns not effective in perceptron training?

A
  • The step activation function f(a) is non-linear, making the number of misclassified points not differentiable.
  • gradient-based methods like the perceptron learning rule require a differentiable error function
  • E_p indirectly aproximates a differentiable error function
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Why is the error function E_p negative for misclassified points?

A

The negative sign ensures that E_p increases when the perceptron misclassifies points, guiding the algorithm to reduce misclassification by updating the weights.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

How does the perceptron update weights to reduce misclassification?

A

The perceptron updates its weights to minimize E_p, moving closer to correctly classifying the misclassified points.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

How is the perceptron criterion applied step by step?

A
  1. The perceptron checks the predicted score before applying the step function.
  2. For misclassified points, it multiplies the predicted score by the true label.
  3. E_p accumulates the negative product for all misclassified points.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Why is it difficult to calculate the gradient over the total error function E_p directly?

A

The total error function E_p is piecewise linear and not smooth because it sums over all misclassified points, causing abrupt changes when the set of misclassified points changes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

How does stochastic gradient descent (SGD) handle the error function?

A

SGD takes just one misclassified point at a time, making the perceptron criterion for a single point linear and smooth, allowing direct gradient computation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is the update rule for stochastic gradient descent in perceptron training?

A
  • w^{t+1} = w^{t} + η ϕ_n t_n
  • η is the learning rate.
  • ϕ_n is the feature vector of the misclassified point.
  • t_n is the target label of the misclassified point (+1 or -1)
32
Q

Why does the update rule not directly depend on the full weight vector w?

A
  • The update is not a function of w
  • The update adjusts weights only based on the misclassified point being considered, allowing the learning rate η to be set to 1 without affecting performance.
33
Q

How does SGD nudge the weights for each misclassified point?

A
  • If a point is positive but classified as negative, the algorithm increases the weights to better capture the positive point.
  • The weights are updated in the correct direction to improve classification.
34
Q

What does the perceptron convergence theorem state?

A

If there exists an exact solution (i.e., the problem is linearly separable), perceptron learning will find a solution in a finite number of steps.

35
Q

How did Minsky and Papert’s criticism of the perceptron convergence theorem impact perceptron research?

A

They pointed out that the perceptron cannot solve non-linearly separable problems, such as the XOR problem, causing a significant slowdown in neural computation research for nearly a decade.

36
Q

For which type of perceptrons was Minsky and Papert’s criticism valid?

A

Their criticism was only valid for single-layer perceptrons, as they cannot model non-linear decision boundaries.

37
Q

What is the first step in the perceptron algorithm visualization?

A

Select one of the misclassified points.

38
Q

How is the weight vector updated in the perceptron algorithm visualization?

A

The arrow from the boundary to the misclassified point is added to the current weight vector, forming a new weight vector.

39
Q

How does a change in the weight vector affect the decision boundary?

A

Since the decision boundary is perpendicular to the weight vector, changing the weight vector shifts the decision boundary.

40
Q

When does the perceptron algorithm stop updating the decision boundary?

A
  1. When all points are correctly classified.
  2. After a maximum number of iterations, if the data is not linearly separable.
41
Q

What are the key takeaways from the perceptron algorithm visualization?

A
  1. The algorithm updates the decision boundary iteratively by shifting the weight vector based on misclassified points.
  2. Learning stops when all points are correctly classified or after a set number of iterations.
  3. The weight vector w directly influences the orientation and position of the decision boundary.
42
Q

What is the objective of probabilistic generative models for classification?

A

To classify an input x by calculating the probability that it belongs to a certain class C_k, denoted as P(C_k|x)

43
Q

How do probabilistic generative models differ from models that directly draw decision boundaries?

A

Instead of directly drawing decision boundaries, they model the probability distributions of data in each class and use Bayes’ Rule to make decisions.

44
Q

How is the posterior probability of class C_1 calculated?

A
  • P(C_1 | x)

==

  • (P(x|C_1)P(C_1)) / (P(x|C_1)P(C_1)) + (P(x|C_2)P(C_2))

==

  • 1/(1+exp(-a))

==

  • σ(a)
  • where a is the log odds, and σ is the logistic sigmoid function
45
Q

What are the log odds

A
  • for posterior probability P(C_1|x) = σ(a)
  • a = (P(x|C_1)P(C_1)) / (P(x|C_2)P(C_2))
  • so calculating p(C_1|x) requires the calculation of the sigmoid function of the log odds a
46
Q

What is the logit function, and how is it related to the sigmoid function?

A

The logit function is the inverse of the sigmoid function

  • a = ln (σ/(1-σ))
47
Q

Why does calculating p(C_1∣x) require the sigmoid function?

A

Because the probability is expressed as the sigmoid of the log odds
𝑎.

48
Q

How is classification generalized to multiple classes?

A
  • For multiple classes, the softmax function is used.
  • It divides the exponentials of confidence scores by their total to produce proper probabilities.
49
Q

What is the formula for the softmax function?

A
  • p(C_k∣x)= [exp(a_k)] / [∑_j exp(a_j)]

Where a_k = ln(p(x∣C_k)p(C_k))

50
Q

How do the sigmoid and softmax functions differ in their use?

A
  1. The sigmoid function is used for binary classification.
  2. The softmax function is used for K-class classification.
51
Q

How do probabilistic generative models extend to continuous inputs?

A

By assuming that

  1. x follows a Gaussian (normal) distribution within each class C_k
  2. that all classes share the same covariance matrix.
52
Q

What is the formula for the posterior probability p(C_1∣x) in continuous input models?

binary classification

A
  • p(C_1∣x)=σ(w^⊤ x + w_0)
  • where w = Σ^−1 (μ_1−μ_2)
  • where w_0 = -(1/2) μ_1^⊤ Σ^−1 μ_1 + (1/2) μ_2^⊤ Σ^−1 μ_2 + ln[p(C_2)/p(C_1)]
53
Q

What parameters are used in modeling the Gaussian distribution for each class?

A
  1. μ_k : Mean vector for class C_k
  2. Σ: Covariance matrix (same for all classes).
  3. D: Number of features (dimensionality of x)
54
Q

What are the key takeaways from modeling with continuous inputs?

A
  1. Data is modeled using a normal distribution.
  2. Posterior probability is calculated using Bayes’ Rule.
  3. The decision boundary is linear due to the vanishing quadratic term from the normal distribution.
  4. Parameters of the sigmoid function are determined by the parameters of the normal distribution.
  5. the priors (p(C_k)) only enter via the bias parameters
55
Q

How is the model for continuous input generalized for multiple classes?

A

For K classes, the discriminant function is:

  • p(C_1∣x) = [exp(a_k)] / [∑_J exp(a_j)]

= a_k(x) = (w_k^⊤ x + w_k0)

  • where w = Σ^−1 (μ_k)
  • where w_k0 = -(1/2) μ_k^⊤ Σ^−1 μ_k + ln[p(C_k)]
56
Q

What simplifications occur when covariances are shared?

A
  1. Shared covariances result in a linear boundary (Linear Discriminant Analysis, LDA).
  2. Unlinked covariances result in a quadratic boundary (Quadratic Discriminant Analysis, QDA).
57
Q

What is the purpose of using maximum likelihood in probabilistic generative models?

A

It helps determine the values of the μ parameters and priors by maximizing the probability that the data is described by the given parameters.

58
Q

How is the parameter q estimated in maximum likelihood?

A
  1. differentiating the log-likelihood with respect to q
  2. setting the derivative to zero, and solving for q
  3. this results as q being the fraction of points in class 1.
  • q = N_1/(N_1+N_1)
59
Q

How is the parameter μ_1 (mean vector for class 1) estimated in maximum likelihood?

A
  1. differentiating the log-likelihood with respect to μ_1
  2. setting the derivative to zero, and solving for μ_1
  3. this results as μ_1 being the average of points in class 1.
  • μ_1 = 1/(N_1) ∑t_n*x_n
60
Q

How does logistic regression differ from models that use normal distributions?

A

Logistic regression directly provides a model for the probability of class membership using a sigmoid function, making it more compact than Gaussian models.

61
Q

Why is logistic regression considered more compact than Gaussian models?

A
  • Logistic regression requires only M parameters for weights and bias
  • Gaussian models require
    2M parameters for means and M(M+1)/2 additional parameters for shared covariance.
62
Q

What function is used in logistic regression to model the probability of a class?

A
  • The sigmoid function is used, which maps a linear combination of the input features to a probability.
  • p(C_1|x) = y(ϕ) = σ(w^T ϕ)
63
Q

What is the maximum likelihood function in logistic regression

A

p(t|w) = prod (y_n^{t_n}) (1-y_n)^{1-t_N}

64
Q

What loss function does logistic regression minimize?

A
  • Logistic regression minimizes the negative log-likelihood, which yields the cross-entropy loss function.
65
Q

What is the cross entropy loss function

A
  • the negative log likelihood
  • -SUM {t_n ln(y_n) + (1-t_n) ln(1-y_n)}
66
Q

What does the cross-entropy loss measure in logistic regression?

A

Cross-entropy measures the difference between the predicted probability distribution and the true distribution, penalizing incorrect predictions more harshly when the model is confident but wrong.

67
Q

How does the gradient of the logistic regression loss resemble that of linear regression?

A

The gradient for logistic regression is similar to the gradient for the sum of squared errors in linear regression, showing a close relationship between the two models.

68
Q

Why does gradient descent work well for logistic regression?

A

Since the sigmoid function is nonlinear but the curvature in weight space is convex, gradient descent converges to the optimal solution efficiently.

69
Q

What key points should be highlighted about logistic regression?

A
  1. Logistic regression models probabilities using the sigmoid function.
  2. It is more compact than Gaussian models, especially in high-dimensional data.
  3. The negative log-likelihood forms the basis of optimization, leading to cross-entropy loss.
70
Q

derivative of the sigmoid function

A
  • i.e., the derivative of the sigmoid function (y(ϕ) = σ(w^T ϕ) = σ(a) with respect to its inside
  • dσ(a) / da = σ(a) (1-σ(a))
  • = y(1-y)
71
Q

What is the Newton-Raphson update rule used in IRLS?

A

It updates the weights by subtracting the gradient of the error function multiplied by the inverse of the Hessian matrix.

72
Q

Why does IRLS for linear regression converge in one step?

A

For linear regression, the error function is quadratic, and the curvature (Hessian matrix) is constant, allowing optimization to converge in a single step.

73
Q

How does IRLS differ when applied to logistic regression compared to linear regression?

A

For logistic regression, IRLS requires iterative updates because the cross-entropy error function is not quadratic, and the weights depend on the current predictions.

74
Q

Why does IRLS involve reweighting in logistic regression?

A

IRLS reweights the basis functions based on the current predictions, requiring iterative updates to reach the optimal solution

75
Q

What are the key differences between gradient descent and IRLS?

A
  1. Gradient descent updates weights using the gradient iteratively, while IRLS solves weighted least squares problems iteratively.
  2. IRLS uses second-order information (Hessian matrix) and converges faster.
  3. Gradient descent requires a learning rate parameter, whereas IRLS does not.
76
Q

Why is IRLS considered an efficient optimization method for logistic regression?

A

IRLS leverages the Newton-Raphson method, which uses second-order information, leading to faster convergence compared to gradient descent.