Linear Models Flashcards
When is a function differentiable and what is meant with a gradient?
A function is differentiable if it has a derivate in any point of it’s domain. A gradient is the derivate of a function in multiple dimensions. e.g. it is a vector of partial derivatives.
Explain linear models in terms of it’s function and loss function.
Linear function: fw (x) = Σ wi * xi + w0
Where :
- wi is the weight or coefficient for feature i
- xi is the input for feature i
- w0 is the bias term
Loss function: argminw L(fw(X))
In this case the loss function defines the kind of model. For example logistic regression uses cross-entropy loss, while Linear SVM’s use Hinge loss. The linear formula itself is equal across these models. The loss function defines how the coefficients are updated during the training phase.
Explain the loss function used in linear regression. Furthermore provide 2 ways to optimize the coefficients this way during the training phase. For each, give an upside an downside.
Linear regression, or Ordinary Least Squares, take as a loss function the sum of squared error:
L(w) = Σ (yn - (wnxn + w0))2
The SSE loss function measures the average squared difference between the predicted output values and the actual output values across all training examples.
The linear regression model can be trained by finding the unique closed-form solution. The upside is that you will obtain the best solution, but the downside of this is that it is very slow as its time complexity is quadratic in relation to the number of features. O(p2n)
Another way to find a solution for the loss function is to use gradient descent. Gradient descent approximates by traversing the gradient of the function in steps. This is a faster approach for data with many features and rows. The downsides is that it very easily overfits.
Provide the steps for performing basic gradient descent. Also provide the two most important hyperparameters.
- Start with initial random set of weights
- Given a differentiable loss function L compute the gradient. ∇ L(ws)
- Update all weights slightly in a ‘downhill’ direction with:
Update rule: ws + 1 = ws - η ∇ L(ws)
Where:
- ws is the old weight
- η is the learning rate
- ∇ L is the gradient
- ws + 1 is the new weight
The hyper parameters are:
- η: too small, slow convergence. Too large, possible divergence.
- Iterations: Too small, no convergence. Too large, waste resources.
- (Learning rate decay with rate k: used for changing the learning rate over time.)
What is the difference between gradient descent (GD) and stochastic gradient descent (SGD), how is this expressed in the formula’s?
SGD computes gradients not on the entire dataset, but on a single data point i at a time n, unlike GD.
GD update rule: ws + 1 = ws - η ∇ L(ws) = ws - η/n Σi = 1 to n ∇ Li(ws)
SGD update rule: ws + 1 = ws - η ∇ L(ws)
How is Ridge regression related to the least squares loss function? Also explain the most important hyper parameters and how tuning them can lead to overfitting and underfitting.
Ridge regression also used the Least Squares Loss function (like linear regression) but adds an additional term to it called the L2 norm:
L2 norm: α Σ(i = 1 to p) w2i
Where α is the shrinkage hyperparameter or regularization factor. This factor restrict a model to avoid overfitting. Increasing α causes more regularization. Having a large α makes the magnitude of the coefficients smaller, making them closer together. AS the coefficients are squared in the loss function. The loss function heavily penalizes large coefficients, and thus prevents overfitting on large coefficients. Models with smaller α are therefore overfitting, while models with a large α (too large) can be underfitting.
Ridge regression is still a convex function and has a closed form solution (Cholesky), which should be used when datasets are small. When datasets are larger, gradient descent can be applied (either SGD, GD or conjugate gradient (CG)).
What are three methods to avoid overfitting next to regularization?
- Add more data
- Remove features
- scaling the data typically helps. E.g. putting all values in the range of [0, 1]
What is meant with L1 regularization?
The “L1 term” is a term that appears in some optimization problems and refers to the sum of the absolute values of the variables. In mathematical terms, the L1 term is defined as:
L1 = Σ|x|
where x is a vector of variables and the absolute value operator “| |” computes the magnitude of a value without considering its sign.
The Lasso Loss function is called L1 regularization because it has an L1 norm as a penalty term to the least squares loss function:
Lasso: Σ (yn - (wnxn + w0))2 + α Σ(i = 1 to p) |wi|
The Lasso function has no closed form solution, but it is still convex, though not strictly convex and therefore is not differentiable. This is because the penalty term creates a ‘kink’ in the loss function where the point at the ‘kink’ is not differentiable. Weight can be optimized by using Coordinate Descent (CD).
L1 prefers coefficients to be exactly zero, and therefore creating sparse models: some features are practically removed from the linear function because the weights are 0.
The most important hyper parameters is alpha, which if set to high leads to underfitting, but if set too low leads to overfitting.
What is coordinate descent, and when should it be used over gradient descent? Explain how it is applied to the Lasso loss function.
Coordinate descent optimizes a single coordinate wi per iteration. This approach is used over gradient descent when the function is not differentiable anymore. An example would be the Lasso loss function. Coordinate ascent has faster iterations and does not require a learning rate hyperparameter. A downside is that it tends to converge more slowly and can’t be parallelized.
Because the L1 term is not differentiable, but still convex we can compute the subgradient. These are areas of the function where the loss function is differentiable.
δwi LLassowi = δwi LSSE (wi) + δwi α|wi|
0 = (wi - ρi)+ α ⋅ δwi |wi|
wi = ρi - α ⋅ δwi |wi|
Lasso solution has the form of a soft thresholding function S.
Explain the Elastic net loss function.
Elastic net combines both L1 and L2 regularization to the Least Squares error in it’s Loss function:
LElastic = Σn to N (yn - (wxn + w0))2 + α ρ Σi = 0 to p |wi| + α(1-ρ) + Σi = 0 to p wi2
- ρ: the L1 ratio, with ρ = 1, LElastic = LLasso.
and with ρ = 0, LElastic = LRidge
This allows learning sparse models (like Lasso) while maintaining L2 regularization benefits.
Weights can be optimized using coordinate descent.
Name three common techniques used for linear classification
- Convert target {neg, pos} classes to {0,1} and treat as a regression task.
E.g. Logistic regression (Log loss), Ridge classification (Least squares + L2 loss). -
Find hyperplane that maximizes the margin between classes.
E.g. Linear Support Vector Machines (Hinge Loss) - Neural networks without activation functions.
E.g. Perceptron (Perceptron Loss)
Explain how Logistics Regression can be used as a classifier.
Logistic regression aims to predict the probability that a point belongs to the positive class. Target values (classes) are converted to numerical values like {0, 1}. It then fits a logistic (or sigmoid) function through these points. In short we are trying to learn a line that splits the set of data into two classes. We can then take a probability theshold (e.g. 0.5) to predict to which class a data point belongs to.
Logistic regression: y^ = logistic(fθ(x)) = 1 / (1 + e-fθ(x)
What is Cross-entropy loss and how can it be optimized.
Cross entropy loss or Log loss is defined by:
Llog(w) = - Σ(n=1 to N) Σ(c=1 to C) pn,clog(qn,c)
Where N are the number of instances and C are the number of classes.
The penalty grows exponentially as differences between p and q increases. E.g. if the prediction is 90% that the data point is of class 1, and that is indeed the case, there will be almost no penalty added as a loss. However if we are 99% sure it was class 1, but the actual label is 0, then we have a huge penalty.
Both L1 and L2 terms can be added in conjunction.
It can be optimized with:
- (Stochastic) Gradient descent (only supports for L2 regularization)
- Coordinate descent
- Newton-Rhapson (only supports L2 regularization). This calculates the second order gradient, where it estimates the curvature of the loss function at a point. This works well if the solution is near convex, but is slow on large datasets.
- Quasi-Newton methods (only L2), which are approximations of hessians, which are faster to compute but need multiple steps.
Data scaling helps convergence and minimizes differences between solvers
Explain the basic elements of Support Vector Machines (SVMs).
SVMs have the following components:
1. A decision boundary (hyperplane), splitting the data into two classes
2. The support vectors are the training samples closest to the hyperplane
3. The margin is the distance between the separating hyperplane and support vectors
The objective is to maximize the margin in order to avoid overfitting.
In order to maximize the margin we have to find the weights that maximize 1/||w|| or 1/||w||2, where ||w|| = Σ(i to p) wi2 under the constraint that each class can be linearly separated by a hyperplane.
For example we can put a constraint on the hyperplane so that it must be > 1 for all positive samples:
g(w) = wxi + w0 > 1 ∀i, y(i) = 1
We now need to find the weights w that satisfy g but maximize f.
Explain how to optimization problem of SVM’s can be solved using Lagrangian multipliers.
- A weight ai (called a dual coefficient) is assigned to every data point xi. These weights reflect how much individual points influence the weight w. The points with non-zero ai are the support vectors.
- Subsequently we have to solve the Primal objective:
yi +- 1 is the correct example xi
LPrimal = (1/2)||w||2 - Σ(i=1 to n)aiyi(wxi + w0) + Σ(i=1 to n) ai
so that
w = Σ(i=1 to n)aiyi = 0
a >= 0 and Σ(i=1 to n)aiyi = 0