Linear Models Flashcards

Question 1

Q

When is a function differentiable and what is meant with a gradient?

Answer

A

A function is differentiable if it has a derivate in any point of it’s domain. A gradient is the derivate of a function in multiple dimensions. e.g. it is a vector of partial derivatives.

Question 2

Q

Explain linear models in terms of it’s function and loss function.

Answer

A

Linear function: f_w (x) = Σ w_i * x_i + w₀
Where :
- w_i is the weight or coefficient for feature i
- x_i is the input for feature i
- w₀ is the bias term

Loss function: argmin_w L(f_w(X))

In this case the loss function defines the kind of model. For example logistic regression uses cross-entropy loss, while Linear SVM’s use Hinge loss. The linear formula itself is equal across these models. The loss function defines how the coefficients are updated during the training phase.

Question 3

Q

Explain the loss function used in linear regression. Furthermore provide 2 ways to optimize the coefficients this way during the training phase. For each, give an upside an downside.

Answer

A

Linear regression, or Ordinary Least Squares, take as a loss function the sum of squared error:
L(w) = Σ (y_n - (w_nx_n + w₀))²

The SSE loss function measures the average squared difference between the predicted output values and the actual output values across all training examples.

The linear regression model can be trained by finding the unique closed-form solution. The upside is that you will obtain the best solution, but the downside of this is that it is very slow as its time complexity is quadratic in relation to the number of features. O(p²n)

Another way to find a solution for the loss function is to use gradient descent. Gradient descent approximates by traversing the gradient of the function in steps. This is a faster approach for data with many features and rows. The downsides is that it very easily overfits.

Question 4

Q

Provide the steps for performing basic gradient descent. Also provide the two most important hyperparameters.

Answer

A

Start with initial random set of weights
Given a differentiable loss function L compute the gradient. ∇ L(w^s)
Update all weights slightly in a ‘downhill’ direction with:
Update rule: w^{s + 1} = w^s - η ∇ L(w^s)

Where:
- w^s is the old weight
- η is the learning rate
- ∇ L is the gradient
- w^{s + 1} is the new weight

The hyper parameters are:
- η: too small, slow convergence. Too large, possible divergence.
- Iterations: Too small, no convergence. Too large, waste resources.
- (Learning rate decay with rate k: used for changing the learning rate over time.)

Question 5

Q

What is the difference between gradient descent (GD) and stochastic gradient descent (SGD), how is this expressed in the formula’s?

Answer

A

SGD computes gradients not on the entire dataset, but on a single data point i at a time n, unlike GD.

GD update rule: w^{s + 1} = w^s - η ∇ L(w^s) = w^s - η/n Σ_{i = 1 to n} ∇ L_i(w^s)
SGD update rule: w^{s + 1} = w^s - η ∇ L(w^s)

Question 6

Q

How is Ridge regression related to the least squares loss function? Also explain the most important hyper parameters and how tuning them can lead to overfitting and underfitting.

Answer

A

Ridge regression also used the Least Squares Loss function (like linear regression) but adds an additional term to it called the L2 norm:
L2 norm: α Σ_{(i = 1 to p)} w²_i

Where α is the shrinkage hyperparameter or regularization factor. This factor restrict a model to avoid overfitting. Increasing α causes more regularization. Having a large α makes the magnitude of the coefficients smaller, making them closer together. AS the coefficients are squared in the loss function. The loss function heavily penalizes large coefficients, and thus prevents overfitting on large coefficients. Models with smaller α are therefore overfitting, while models with a large α (too large) can be underfitting.

Ridge regression is still a convex function and has a closed form solution (Cholesky), which should be used when datasets are small. When datasets are larger, gradient descent can be applied (either SGD, GD or conjugate gradient (CG)).

Question 7

Q

What are three methods to avoid overfitting next to regularization?

Answer

A

Add more data
Remove features
scaling the data typically helps. E.g. putting all values in the range of [0, 1]

Question 8

Q

What is meant with L1 regularization?

Answer

A

The “L1 term” is a term that appears in some optimization problems and refers to the sum of the absolute values of the variables. In mathematical terms, the L1 term is defined as:

L1 = Σ|x|

where x is a vector of variables and the absolute value operator “| |” computes the magnitude of a value without considering its sign.

The Lasso Loss function is called L1 regularization because it has an L1 norm as a penalty term to the least squares loss function:

Lasso: Σ (y_n - (w_nx_n + w₀))² + α Σ_{(i = 1 to p)} |w_i|

The Lasso function has no closed form solution, but it is still convex, though not strictly convex and therefore is not differentiable. This is because the penalty term creates a ‘kink’ in the loss function where the point at the ‘kink’ is not differentiable. Weight can be optimized by using Coordinate Descent (CD).

L1 prefers coefficients to be exactly zero, and therefore creating sparse models: some features are practically removed from the linear function because the weights are 0.

The most important hyper parameters is alpha, which if set to high leads to underfitting, but if set too low leads to overfitting.

Question 9

Q

What is coordinate descent, and when should it be used over gradient descent? Explain how it is applied to the Lasso loss function.

Answer

A

Coordinate descent optimizes a single coordinate w_i per iteration. This approach is used over gradient descent when the function is not differentiable anymore. An example would be the Lasso loss function. Coordinate ascent has faster iterations and does not require a learning rate hyperparameter. A downside is that it tends to converge more slowly and can’t be parallelized.

Because the L1 term is not differentiable, but still convex we can compute the subgradient. These are areas of the function where the loss function is differentiable.

δ_wi L_Lassow_i = δ_wi L_SSE (w_i) + δ_wi α|w_i|

0 = (w_i - ρ_i)+ α ⋅ δ_wi |w_i|

w_i = ρ_i - α ⋅ δ_wi |w_i|

Lasso solution has the form of a soft thresholding function S.

Question 10

Q

Explain the Elastic net loss function.

Answer

A

Elastic net combines both L1 and L2 regularization to the Least Squares error in it’s Loss function:
L_Elastic = Σ_{n to N} (y_n - (wx_n + w₀))² + α ρ Σ_{i = 0 to p} |w_i| + α(1-ρ) + Σ_{i = 0 to p} w_i²

ρ: the L1 ratio, with ρ = 1, L_Elastic = L_Lasso.
and with ρ = 0, L_Elastic = L_Ridge

This allows learning sparse models (like Lasso) while maintaining L2 regularization benefits.

Weights can be optimized using coordinate descent.

Question 11

Q

Name three common techniques used for linear classification

Answer

A

Convert target {neg, pos} classes to {0,1} and treat as a regression task.
E.g. Logistic regression (Log loss), Ridge classification (Least squares + L2 loss).
Find hyperplane that maximizes the margin between classes.
E.g. Linear Support Vector Machines (Hinge Loss)
Neural networks without activation functions.
E.g. Perceptron (Perceptron Loss)

Question 12

Q

Explain how Logistics Regression can be used as a classifier.

Answer

A

Logistic regression aims to predict the probability that a point belongs to the positive class. Target values (classes) are converted to numerical values like {0, 1}. It then fits a logistic (or sigmoid) function through these points. In short we are trying to learn a line that splits the set of data into two classes. We can then take a probability theshold (e.g. 0.5) to predict to which class a data point belongs to.

Logistic regression: y^{^} = logistic(f_θ(x)) = 1 / (1 + e^-f_θ(x)

Question 13

Q

What is Cross-entropy loss and how can it be optimized.

Answer

A

Cross entropy loss or Log loss is defined by:
L_log(w) = - Σ_{(n=1 to N)} Σ_{(c=1 to C)} p_n,clog(q_n,c)

Where N are the number of instances and C are the number of classes.

The penalty grows exponentially as differences between p and q increases. E.g. if the prediction is 90% that the data point is of class 1, and that is indeed the case, there will be almost no penalty added as a loss. However if we are 99% sure it was class 1, but the actual label is 0, then we have a huge penalty.

Both L1 and L2 terms can be added in conjunction.

It can be optimized with:
- (Stochastic) Gradient descent (only supports for L2 regularization)
- Coordinate descent
- Newton-Rhapson (only supports L2 regularization). This calculates the second order gradient, where it estimates the curvature of the loss function at a point. This works well if the solution is near convex, but is slow on large datasets.
- Quasi-Newton methods (only L2), which are approximations of hessians, which are faster to compute but need multiple steps.

Data scaling helps convergence and minimizes differences between solvers

Question 14

Q

Explain the basic elements of Support Vector Machines (SVMs).

Answer

A

SVMs have the following components:
1. A decision boundary (hyperplane), splitting the data into two classes
2. The support vectors are the training samples closest to the hyperplane
3. The margin is the distance between the separating hyperplane and support vectors
The objective is to maximize the margin in order to avoid overfitting.

In order to maximize the margin we have to find the weights that maximize 1/||w|| or 1/||w||², where ||w|| = Σ_{(i to p)} w_i² under the constraint that each class can be linearly separated by a hyperplane.

For example we can put a constraint on the hyperplane so that it must be > 1 for all positive samples:
g(w) = wx_i + w₀ > 1 ∀i, y(i) = 1
We now need to find the weights w that satisfy g but maximize f.

Question 15

Q

Explain how to optimization problem of SVM’s can be solved using Lagrangian multipliers.

Answer

A

A weight a_i (called a dual coefficient) is assigned to every data point x_i. These weights reflect how much individual points influence the weight w. The points with non-zero a_i are the support vectors.
Subsequently we have to solve the Primal objective:
y_i +- 1 is the correct example x_i
L_Primal = (1/2)||w||² - Σ_{(i=1 to n)}a_iy_i(wx_i + w₀) + Σ_{(i=1 to n)} a_i
so that
w = Σ_{(i=1 to n)}a_iy_i = 0
a >= 0 and Σ_{(i=1 to n)}a_iy_i = 0

Question 16

Q

How can SVM’s be used if the data is not linearly seperable?

Answer

Study These Flashcards

A

We relax the constraint by allowing an error ξ_i =max(0, 1- y_i(wx_i + w₀)), creating a soft-margin.

The sum over all points is called hinge loss: Σ_{(i to n)} ξ_i

This gives us the objective with hyper parameter C:

L(w) = ||w||² + C Σ_{(i to n)}ξ_i

This means that if the actual class is 1 and the prediction value is >= 1, there will not be any loss. The loss is added linearly based on the prediction value if < 1. In practice this creates some slack. Hinge acts as L1 regularization, leading to sparse models.

C is the inverse regularization strength: Larger C means fewer support vectors, smaller margin and more overfitting. Smaller C means more support vectors, wider margin, less overfitting (but chance of underfitting).

We can also square ξ_i to obtain the squared hinge loss, meaning the loss will grow exponentially if < 1 (e.g. prediction value of 0.90 has a small error if the actual label is 1, but if the prediction value is -1.5, there will be a large loss added).

Question 17

Q

How can we solve the objective of soft-margin SVMs?

Answer

Study These Flashcards

A

Soft-margin SVMs can, alternatively to Lagrange multipliers, be solved with gradient descent.

Squared hinge loss is differentiable
Hinge loss is not differentiable but the function is still convex and has a subgradient.

Using gradient descent here is a good for large datasets, but it doesn not yield support vectors or kernalization.

Question 18

Q

Name two ways in which linear classification models can be used for multi-class (non-binary) classification tasks.

Answer

Study These Flashcards

A

One-vs-rest: Learn a binary model for each class vs. all other classes. Create as many binary models as there are classes.
One-vs-One: Learn a binary model for every combination of two classes. For C classes, this results in (C(C-1))/2 binary models. Each point is then classified according to a majority vote amongst all models.

One-vs-one requires more models than one-vs-rest but training each model is faster. Recommended for algorithms that learn well on small datasets.

Question 19

Q

Explain the perceptron.

Answer

Study These Flashcards

A

The Pereceptron represents a single neuron with inputs x_i, as bis w₀ and output y. Each connection has a weight w_i. The node outputs:
y^{^}: Σ_{(i to n)} x_iw_i + w₀.
The activation function predicts 1 if xw + w₀ > 0, -1 otherwise.

Weights can be learned with gradient descent or Hinge loss, where weights are only updated on misclassification.

Linear Models Flashcards

(19 cards)