05_supervised learning methods Flashcards
What is a linear model?
they assume linearity in the underlying data
rather simple but convey many of the concepts utilized in other, more complex models
eg linear regression, linear classification
What does a linear regression do)
find weights w0 and w1
so that the linear function f(x)=w1 º x + w0
with input x and output y
best fits the data containing ground-truth values y’
–> how can we learn w0 and w1 from data?
How do linear regression models learn the weights for function f?
minimize square errors of prediction
with respect to ground truth
for each data point:
Least Square fitting
for data point j: [yj’ - f (wj, w0, w1) ] ^2
How does least squares fitting work?
1) define a loss (objective) function that is the sum of the squared errors over all data points
2) find the best-fit model parameters by minimizing the loss function with respect to those two model parameters
(first derivative –> closed-form expression for best-fit w0 and w1)
least-squares + linear model function: the resulting minimum of the loss function is GLOBAL (bc of the combo)
–> the model learns immediately the best-possible solution
When can linear functions be used as a classifier?
When the data is linearly separable (and only then!)
How do linear functions work as a classifier?
1) define decision boundary
f(x,w) = wx = w0 + w1x1 + w2x2
such that class 1: f(x,w) ≥ 0
and class 0: f(x,w) < 0
2) we can define class assignments through a threshold function
What is the perceptron learning rule?
weights are adjusted by a step size that is called the LEARNING RATE.
by iteratively running this algorithm over your training data multiple times, the weights can be learned so that the model performs properly
–> solution is learned iteratively
–> does not imply that the model is learning something useful (eg dataset might not be suitable)
Why can linear models not often be applied?
they have low predictive capacity
–> can only be applied to data that is linearly distributed
How can linear models be fit for more capacity?
the base function can be changed to a polynomial:
f(x) = w0 + w1x + w2x^2 etc
Summe wix^i
the resulting regression problem is still linear in the weights to be found, therefore the same properties apply:
we can compute the parameters wi to minimize the loss with a closed-form expression. this is by default the best possible solution
What can be changed for a polynomial linear function model in order to get more capacity?
p (which says how many weights and x we have. have to find the best p)
–> when p is too low, the decision boundary looks like a constant, if it’s too high it can also be not a good fit (underfitting vs overfitting)
What does Occam’s Razor mean?
a model with fewer parameters is to be preferred
What is the goal of any regression model?
to minimize the loss over the data, ie prediction errors
What is a way to prevent a model from overfitting?
regularize the loss based on the learned weights
L’(x,w) = mean loss over N samples + regularization term
What it the L2-norm?
||w||2 2 = w * w
What happens when you add a L2 regularization term to a polynomial model?
The regularization term takes the weights and adds it to the loss function
–> the loss function cannot go as low as it wants to
with increasing alpha, all coefficients wi drop in magnitude, leading to smoother fits
What is L2 regularization also called?
ridge regression
What is L1 also called?
LASSO regression
least absolute shrinkage and selection operator
What is the L1-norm?
||w||1 = |w|