lecture 4 - classification Flashcards
How do linear models handle classification tasks?
Linear models for classification take an input vector x and map it onto one of K discrete classes by using a separable hyperplane in the input space.
How are linear models separated in a D-dimensional input space?
They are separated by (D−1)-dimensional hyperplanes.
How are linear models represented in regression?
y(x) = w^T x + w_0
What is the role of the activation function in classification with linear models?
- The activation function f(⋅) maps the output of the linear model to discrete classes, converting the continuous output into a class label.
- This makes the model nonlinear in its outputs while the underlying equation remains linear.
How can a step function be used for classification?
A step function can assign
- y(x)>0 to “class 1”
- y(x)≤0 to “class 2”
What is a discriminant function in classification tasks?
A discriminant function is a mathematical function used to separate data points into distinct classes by mapping input features to a decision boundary.
What is the simplest form of a discriminant function for a 2-class classification problem?
y(x) = w^T x + w_0
How is the decision boundary defined in classification?
The decision boundary is the set of all points x that satisfy:
- y(x) = w^T x + w_0 = 0
How are classes assigned based on the discriminant function?
- y(x)>0 to “class 1”
- y(x)<0 to “class 2”
- y(x)=0 is the decision boundary
If points x_a and x_b lie on the decision surface, then:
- y(x_a) = y(x_b) = 0
therefore
- w^T (x_a - x_b) = 0 (dot product of the two vectors is zero)
this indicates that w is orthogonal to every vector lying within the decision surface
How is the decision boundary equation interpreted in terms of projection and bias?
- The left side represents the projection of x onto w and determines how far x is from the boundary
- The right side represents the displacement or location of the decision boundary relative to the origin (w_0)
Why is using multiple 2-class classifiers for K-class classification not ideal?
It can lead to ambiguous regions where boundaries overlap, making it unclear which class a point belongs to.
What is the solution for well-defined decision boundaries in K-class classification?
Use a unified K-class classifier where each class C_k has its own discriminant linear function of the form y_k(x) = w_k^T x + w_k0
What is the decision rule for assigning a point to a class in K-class classification?
A point x belongs to class C_k if y_k(x) > y_j(x) for all j =/= k
How is the decision boundary between two classes defined in K-class classification?
- The boundary between classes C_k and C_j occurs when their scores are equal
- y_k(x) = y_j(x)
- this results in a (D-1) dimensional hyperplane
What is the general form of the hyperplane between two classes C_k and C_j?
(w_k - w_j)^T x + (w_k0 - w_j0) = 0
What is a property of the decision boundary in K-class classification?
Linearity of the discriminant functions makes the decision boundary in a K-class classifier singly connected and convex
What are the steps for assigning a class using a K-class classifier?
- Define the linear discriminants for each class.
- Assign weights and biases for each class.
- Calculate the discriminant scores for a given point.
- Assign the point to the class with the highest score.
How are weights and biases assigned for each class in K-class classification?
Each class C_k is assigned a weight vector w_k and a bias w_k0 which define its discriminant function
What does the discriminant score represent in a K-class classifier?
The discriminant score represents how strongly a data point is associated with a specific class.
What is a perceptron
- The perceptron is the first model that could learn.
- It is a linear model with a step activation function, classifying inputs into two distinct categories.
How does the step activation function in a perceptron work?
- y(x) = f(w^t ϕ(x))
- f(a)
- if a is positive or zero, it outputs +1, indicating one class.
- if a is negative, it outputs -1, indicating another class.
What criterion is used for training a perceptron?
Training is done using the perceptron criterion, which focuses on minimizing the total error function E_p
What does the total error function E_p represent in perceptron training?
- E_p is a score that tells how “wrong” the perceptron is on the points it misclassifies, based on the sum of predicted output multiplied by target output for the misclassified points.
- focuses only on the n misclassified points in M
- computes the sum of the terms (w^⊤ϕ_n t_n), which is the weight vectr * predicted output * target output
Why is direct misclassification using the total number of misclassified patterns not effective in perceptron training?
- The step activation function f(a) is non-linear, making the number of misclassified points not differentiable.
- gradient-based methods like the perceptron learning rule require a differentiable error function
- E_p indirectly aproximates a differentiable error function
Why is the error function E_p negative for misclassified points?
The negative sign ensures that E_p increases when the perceptron misclassifies points, guiding the algorithm to reduce misclassification by updating the weights.
How does the perceptron update weights to reduce misclassification?
The perceptron updates its weights to minimize E_p, moving closer to correctly classifying the misclassified points.
How is the perceptron criterion applied step by step?
- The perceptron checks the predicted score before applying the step function.
- For misclassified points, it multiplies the predicted score by the true label.
- E_p accumulates the negative product for all misclassified points.
Why is it difficult to calculate the gradient over the total error function E_p directly?
The total error function E_p is piecewise linear and not smooth because it sums over all misclassified points, causing abrupt changes when the set of misclassified points changes.
How does stochastic gradient descent (SGD) handle the error function?
SGD takes just one misclassified point at a time, making the perceptron criterion for a single point linear and smooth, allowing direct gradient computation.