Linear Models Flashcards

1
Q

When is a function differentiable and what is meant with a gradient?

A

A function is differentiable if it has a derivate in any point of it’s domain. A gradient is the derivate of a function in multiple dimensions. e.g. it is a vector of partial derivatives.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Explain linear models in terms of it’s function and loss function.

A

Linear function: fw (x) = Σ wi * xi + w0
Where :
- wi is the weight or coefficient for feature i
- xi is the input for feature i
- w0 is the bias term

Loss function: argminw L(fw(X))

In this case the loss function defines the kind of model. For example logistic regression uses cross-entropy loss, while Linear SVM’s use Hinge loss. The linear formula itself is equal across these models. The loss function defines how the coefficients are updated during the training phase.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain the loss function used in linear regression. Furthermore provide 2 ways to optimize the coefficients this way during the training phase. For each, give an upside an downside.

A

Linear regression, or Ordinary Least Squares, take as a loss function the sum of squared error:
L(w) = Σ (yn - (wnxn + w0))2

The SSE loss function measures the average squared difference between the predicted output values and the actual output values across all training examples.

The linear regression model can be trained by finding the unique closed-form solution. The upside is that you will obtain the best solution, but the downside of this is that it is very slow as its time complexity is quadratic in relation to the number of features. O(p2n)

Another way to find a solution for the loss function is to use gradient descent. Gradient descent approximates by traversing the gradient of the function in steps. This is a faster approach for data with many features and rows. The downsides is that it very easily overfits.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Provide the steps for performing basic gradient descent. Also provide the two most important hyperparameters.

A
  1. Start with initial random set of weights
  2. Given a differentiable loss function L compute the gradient. ∇ L(ws)
  3. Update all weights slightly in a ‘downhill’ direction with:
    Update rule: ws + 1 = ws - ηL(ws)

Where:
- ws is the old weight
- η is the learning rate
- ∇ L is the gradient
- ws + 1 is the new weight

The hyper parameters are:
- η: too small, slow convergence. Too large, possible divergence.
- Iterations: Too small, no convergence. Too large, waste resources.
- (Learning rate decay with rate k: used for changing the learning rate over time.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the difference between gradient descent (GD) and stochastic gradient descent (SGD), how is this expressed in the formula’s?

A

SGD computes gradients not on the entire dataset, but on a single data point i at a time n, unlike GD.

GD update rule: ws + 1 = ws - ηL(ws) = ws - η/n Σi = 1 to nLi(ws)
SGD update rule: ws + 1 = ws - ηL(ws)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How is Ridge regression related to the least squares loss function? Also explain the most important hyper parameters and how tuning them can lead to overfitting and underfitting.

A

Ridge regression also used the Least Squares Loss function (like linear regression) but adds an additional term to it called the L2 norm:
L2 norm: α Σ(i = 1 to p) w2i

Where α is the shrinkage hyperparameter or regularization factor. This factor restrict a model to avoid overfitting. Increasing α causes more regularization. Having a large α makes the magnitude of the coefficients smaller, making them closer together. AS the coefficients are squared in the loss function. The loss function heavily penalizes large coefficients, and thus prevents overfitting on large coefficients. Models with smaller α are therefore overfitting, while models with a large α (too large) can be underfitting.

Ridge regression is still a convex function and has a closed form solution (Cholesky), which should be used when datasets are small. When datasets are larger, gradient descent can be applied (either SGD, GD or conjugate gradient (CG)).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are three methods to avoid overfitting next to regularization?

A
  1. Add more data
  2. Remove features
  3. scaling the data typically helps. E.g. putting all values in the range of [0, 1]
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is meant with L1 regularization?

A

The “L1 term” is a term that appears in some optimization problems and refers to the sum of the absolute values of the variables. In mathematical terms, the L1 term is defined as:

L1 = Σ|x|

where x is a vector of variables and the absolute value operator “| |” computes the magnitude of a value without considering its sign.

The Lasso Loss function is called L1 regularization because it has an L1 norm as a penalty term to the least squares loss function:

Lasso: Σ (yn - (wnxn + w0))2 + α Σ(i = 1 to p) |wi|

The Lasso function has no closed form solution, but it is still convex, though not strictly convex and therefore is not differentiable. This is because the penalty term creates a ‘kink’ in the loss function where the point at the ‘kink’ is not differentiable. Weight can be optimized by using Coordinate Descent (CD).

L1 prefers coefficients to be exactly zero, and therefore creating sparse models: some features are practically removed from the linear function because the weights are 0.

The most important hyper parameters is alpha, which if set to high leads to underfitting, but if set too low leads to overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is coordinate descent, and when should it be used over gradient descent? Explain how it is applied to the Lasso loss function.

A

Coordinate descent optimizes a single coordinate wi per iteration. This approach is used over gradient descent when the function is not differentiable anymore. An example would be the Lasso loss function. Coordinate ascent has faster iterations and does not require a learning rate hyperparameter. A downside is that it tends to converge more slowly and can’t be parallelized.

Because the L1 term is not differentiable, but still convex we can compute the subgradient. These are areas of the function where the loss function is differentiable.

δwi LLassowi = δwi LSSE (wi) + δwi α|wi|

0 = (wi - ρi)+ α ⋅ δwi |wi|

wi = ρi - α ⋅ δwi |wi|

Lasso solution has the form of a soft thresholding function S.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain the Elastic net loss function.

A

Elastic net combines both L1 and L2 regularization to the Least Squares error in it’s Loss function:
LElastic = Σn to N (yn - (wxn + w0))2 + α ρ Σi = 0 to p |wi| + α(1-ρ) + Σi = 0 to p wi2

  • ρ: the L1 ratio, with ρ = 1, LElastic = LLasso.
    and with ρ = 0, LElastic = LRidge

This allows learning sparse models (like Lasso) while maintaining L2 regularization benefits.

Weights can be optimized using coordinate descent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Name three common techniques used for linear classification

A
  1. Convert target {neg, pos} classes to {0,1} and treat as a regression task.
    E.g. Logistic regression (Log loss), Ridge classification (Least squares + L2 loss).
  2. Find hyperplane that maximizes the margin between classes.
    E.g. Linear Support Vector Machines (Hinge Loss)
  3. Neural networks without activation functions.
    E.g. Perceptron (Perceptron Loss)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Explain how Logistics Regression can be used as a classifier.

A

Logistic regression aims to predict the probability that a point belongs to the positive class. Target values (classes) are converted to numerical values like {0, 1}. It then fits a logistic (or sigmoid) function through these points. In short we are trying to learn a line that splits the set of data into two classes. We can then take a probability theshold (e.g. 0.5) to predict to which class a data point belongs to.

Logistic regression: y^ = logistic(fθ(x)) = 1 / (1 + e-fθ(x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Cross-entropy loss and how can it be optimized.

A

Cross entropy loss or Log loss is defined by:
Llog(w) = - Σ(n=1 to N) Σ(c=1 to C) pn,clog(qn,c)

Where N are the number of instances and C are the number of classes.

The penalty grows exponentially as differences between p and q increases. E.g. if the prediction is 90% that the data point is of class 1, and that is indeed the case, there will be almost no penalty added as a loss. However if we are 99% sure it was class 1, but the actual label is 0, then we have a huge penalty.

Both L1 and L2 terms can be added in conjunction.

It can be optimized with:
- (Stochastic) Gradient descent (only supports for L2 regularization)
- Coordinate descent
- Newton-Rhapson (only supports L2 regularization). This calculates the second order gradient, where it estimates the curvature of the loss function at a point. This works well if the solution is near convex, but is slow on large datasets.
- Quasi-Newton methods (only L2), which are approximations of hessians, which are faster to compute but need multiple steps.

Data scaling helps convergence and minimizes differences between solvers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Explain the basic elements of Support Vector Machines (SVMs).

A

SVMs have the following components:
1. A decision boundary (hyperplane), splitting the data into two classes
2. The support vectors are the training samples closest to the hyperplane
3. The margin is the distance between the separating hyperplane and support vectors
The objective is to maximize the margin in order to avoid overfitting.

In order to maximize the margin we have to find the weights that maximize 1/||w|| or 1/||w||2, where ||w|| = Σ(i to p) wi2 under the constraint that each class can be linearly separated by a hyperplane.

For example we can put a constraint on the hyperplane so that it must be > 1 for all positive samples:
g(w) = wxi + w0 > 1 ∀i, y(i) = 1
We now need to find the weights w that satisfy g but maximize f.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Explain how to optimization problem of SVM’s can be solved using Lagrangian multipliers.

A
  • A weight ai (called a dual coefficient) is assigned to every data point xi. These weights reflect how much individual points influence the weight w. The points with non-zero ai are the support vectors.
  • Subsequently we have to solve the Primal objective:
    yi +- 1 is the correct example xi
    LPrimal = (1/2)||w||2 - Σ(i=1 to n)aiyi(wxi + w0) + Σ(i=1 to n) ai
    so that
    w = Σ(i=1 to n)aiyi = 0
    a >= 0 and Σ(i=1 to n)aiyi = 0
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How can SVM’s be used if the data is not linearly seperable?

A

We relax the constraint by allowing an error ξi =max(0, 1- yi(wxi + w0)), creating a soft-margin.

The sum over all points is called hinge loss: Σ(i to n) ξi

This gives us the objective with hyper parameter C:

L(w) = ||w||2 + C Σ(i to n)ξi

This means that if the actual class is 1 and the prediction value is >= 1, there will not be any loss. The loss is added linearly based on the prediction value if < 1. In practice this creates some slack. Hinge acts as L1 regularization, leading to sparse models.

C is the inverse regularization strength: Larger C means fewer support vectors, smaller margin and more overfitting. Smaller C means more support vectors, wider margin, less overfitting (but chance of underfitting).

We can also square ξi to obtain the squared hinge loss, meaning the loss will grow exponentially if < 1 (e.g. prediction value of 0.90 has a small error if the actual label is 1, but if the prediction value is -1.5, there will be a large loss added).

17
Q

How can we solve the objective of soft-margin SVMs?

A

Soft-margin SVMs can, alternatively to Lagrange multipliers, be solved with gradient descent.

  • Squared hinge loss is differentiable
  • Hinge loss is not differentiable but the function is still convex and has a subgradient.

Using gradient descent here is a good for large datasets, but it doesn not yield support vectors or kernalization.

18
Q

Name two ways in which linear classification models can be used for multi-class (non-binary) classification tasks.

A
  • One-vs-rest: Learn a binary model for each class vs. all other classes. Create as many binary models as there are classes.
  • One-vs-One: Learn a binary model for every combination of two classes. For C classes, this results in (C(C-1))/2 binary models. Each point is then classified according to a majority vote amongst all models.

One-vs-one requires more models than one-vs-rest but training each model is faster. Recommended for algorithms that learn well on small datasets.

19
Q

Explain the perceptron.

A

The Pereceptron represents a single neuron with inputs xi, as bis w0 and output y. Each connection has a weight wi. The node outputs:
y^: Σ(i to n) xiwi + w0.
The activation function predicts 1 if xw + w0 > 0, -1 otherwise.

Weights can be learned with gradient descent or Hinge loss, where weights are only updated on misclassification.