lecture 4 - classification Flashcards

Question

Why is direct misclassification using the total number of misclassified patterns not effective in perceptron training?

Answer 1

- The step activation function f(a) is non-linear, making the number of misclassified points not differentiable. - gradient-based methods like the perceptron learning rule require a differentiable error function - E_p **indirectly aproximates** a differentiable error function

Answer 2

The **negative sign** ensures that E_p increases when the perceptron misclassifies points, guiding the algorithm to reduce misclassification by updating the weights.

Answer 3

The perceptron updates its weights to minimize E_p, moving closer to correctly classifying the misclassified points.

Answer 4

1. The perceptron checks the predicted score before applying the step function. 2. For misclassified points, it multiplies the predicted score by the true label. 3. E_p accumulates the negative product for all misclassified points.

Answer 5

The total error function E_p is piecewise linear and not smooth because it **sums over all misclassified points**, causing abrupt changes when the set of misclassified points changes.

Answer 6

SGD takes just one misclassified point at a time, making the perceptron criterion for a single point linear and smooth, allowing direct gradient computation.

Answer 7

- w^{t+1} = w^{t} + η ϕ_n t_n - η is the learning rate. - ϕ_n is the feature vector of the misclassified point. - t_n is the target label of the misclassified point (+1 or -1)

Answer 8

- The update is not a function of w - The update adjusts weights only based on the misclassified point being considered, allowing the learning rate η to be set to 1 without affecting performance.

Answer 9

- If a point is positive but classified as negative, the algorithm increases the weights to better capture the positive point. - The weights are updated in the correct direction to improve classification.

Answer 10

If there exists an exact solution (i.e., the problem is linearly separable), perceptron learning will find a solution in a finite number of steps.

Answer 11

They pointed out that the perceptron cannot solve non-linearly separable problems, such as the XOR problem, causing a significant slowdown in neural computation research for nearly a decade.

Answer 12

Their criticism was only valid for **single-layer perceptrons**, as they cannot model non-linear decision boundaries.

Answer 13

Select one of the misclassified points.

Answer 14

The arrow from the boundary to the misclassified point is added to the current weight vector, forming a new weight vector.

Answer 15

Since the decision boundary is perpendicular to the weight vector, changing the weight vector shifts the decision boundary.

Answer 16

1. When all points are correctly classified. 2. After a maximum number of iterations, if the data is not linearly separable.

Answer 17

1. The algorithm updates the decision boundary iteratively by shifting the weight vector based on misclassified points. 2. Learning stops when all points are correctly classified or after a set number of iterations. 3. The weight vector w directly influences the orientation and position of the decision boundary.

Answer 18

To classify an input x by calculating the probability that it belongs to a certain class C_k, denoted as P(C_k|x)

Answer 19

Instead of directly drawing decision boundaries, they model the probability distributions of data in each class and use Bayes' Rule to make decisions.

Answer 20

- P(C_1 | x) == - (P(x|C_1)P(C_1)) / (P(x|C_1)P(C_1)) + (P(x|C_2)P(C_2)) == - 1/(1+exp(-a)) == - σ(a) - where a is the log odds, and σ is the logistic sigmoid function

Answer 21

- for posterior probability P(C_1|x) = σ(a) - a = (P(x|C_1)P(C_1)) / (P(x|C_2)P(C_2)) - so calculating p(C_1|x) requires the calculation of the sigmoid function of the log odds a

Answer 22

The logit function is the inverse of the sigmoid function - a = ln (σ/(1-σ))

Answer 23

Because the probability is expressed as the sigmoid of the log odds 𝑎.

Answer 24

- For multiple classes, the softmax function is used. - It divides the exponentials of confidence scores by their total to produce proper probabilities.

Answer 25

- p(C_k∣x)= [exp(a_k)] / [∑_j exp(a_j)] Where a_k = ln(p(x∣C_k)p(C_k))

Answer 26

1. The sigmoid function is used for binary classification. 2. The softmax function is used for K-class classification.

Answer 27

By assuming that 1. x follows a Gaussian (normal) distribution within each class C_k 2. that all classes share the same covariance matrix.

Answer 28

- p(C_1∣x)=σ(w^⊤ x + w_0) - where w = Σ^−1 (μ_1−μ_2) - where w_0 = -(1/2) μ_1^⊤ Σ^−1 μ_1 + (1/2) μ_2^⊤ Σ^−1 μ_2 + ln[p(C_2)/p(C_1)]

Answer 29

1. μ_k : Mean vector for class C_k 2. Σ: Covariance matrix (same for all classes). 3. D: Number of features (dimensionality of x)

Answer 30

1. Data is modeled using a normal distribution. 2. Posterior probability is calculated using Bayes' Rule. 3. The decision boundary is linear due to the vanishing quadratic term from the normal distribution. 4. Parameters of the sigmoid function are determined by the parameters of the normal distribution. 5. the priors (p(C_k)) only enter via the bias parameters

Answer 31

For K classes, the discriminant function is: - p(C_1∣x) = [exp(a_k)] / [∑_J exp(a_j)] = a_k(x) = (w_k^⊤ x + w_k0) - where w = Σ^−1 (μ_k) - where w_k0 = -(1/2) μ_k^⊤ Σ^−1 μ_k + ln[p(C_k)]

Answer 32

1. Shared covariances result in a linear boundary (Linear Discriminant Analysis, LDA). 2. Unlinked covariances result in a quadratic boundary (Quadratic Discriminant Analysis, QDA).

Answer 33

It helps **determine the values of the μ parameters and priors** by maximizing the probability that the data is described by the given parameters.

Answer 34

1. differentiating the log-likelihood with respect to q 2. setting the derivative to zero, and solving for q 3. this results as q being the fraction of points in class 1. - q = N_1/(N_1+N_1)

Answer 35

1. differentiating the log-likelihood with respect to μ_1 2. setting the derivative to zero, and solving for μ_1 3. this results as μ_1 being the average of points in class 1. - μ_1 = 1/(N_1) ∑t_n*x_n

Answer 36

Logistic regression directly provides a model for the probability of class membership using a sigmoid function, making it more compact than Gaussian models.

Answer 37

- Logistic regression requires only M parameters for weights and bias - Gaussian models require 2M parameters for means and M(M+1)/2 additional parameters for shared covariance.

Answer 38

- The sigmoid function is used, which maps a linear combination of the input features to a probability. - p(C_1|x) = y(ϕ) = σ(w^T ϕ)

Answer 39

p(t|w) = prod (y_n^{t_n}) (1-y_n)^{1-t_N}

Answer 40

- Logistic regression minimizes the negative log-likelihood, which yields the cross-entropy loss function.

Answer 41

- the negative log likelihood - -SUM {t_n ln(y_n) + (1-t_n) ln(1-y_n)}

Answer 42

Cross-entropy measures the **difference between the predicted probability distribution and the true distribution**, penalizing incorrect predictions more harshly when the model is confident but wrong.

Answer 43

The gradient for logistic regression is **similar to the gradient for the sum of squared errors in linear regression**, showing a close relationship between the two models.

Answer 44

Since the sigmoid function is nonlinear but the curvature in weight space is convex, gradient descent converges to the optimal solution efficiently.

Answer 45

1. Logistic regression models probabilities using the sigmoid function. 2. It is more compact than Gaussian models, especially in high-dimensional data. 3. The negative log-likelihood forms the basis of optimization, leading to cross-entropy loss.

Answer 46

- i.e., the derivative of the sigmoid function (y(ϕ) = σ(w^T ϕ) = σ(a) with respect to its inside - dσ(a) / da = σ(a) (1-σ(a)) - = y(1-y)

Answer 47

It updates the weights by subtracting the gradient of the error function multiplied by the inverse of the Hessian matrix.

Answer 48

For linear regression, the **error function is quadratic**, and the curvature (Hessian matrix) is constant, allowing optimization to converge in a single step.

Answer 49

For logistic regression, IRLS requires iterative updates because the **cross-entropy error function is not quadratic**, and the weights depend on the current predictions.

Answer 50

IRLS reweights the basis functions based on the current predictions, requiring iterative updates to reach the optimal solution

Answer 51

1. Gradient descent updates weights using the gradient iteratively, while IRLS solves weighted least squares problems iteratively. 2. IRLS uses second-order information (Hessian matrix) and converges faster. 3. Gradient descent requires a learning rate parameter, whereas IRLS does not.

Answer 52

IRLS leverages the Newton-Raphson method, which uses second-order information, leading to faster convergence compared to gradient descent.

lecture 4 - classification Flashcards

(76 cards)