05_supervised learning methods Flashcards

1
Q

What is a linear model?

A

they assume linearity in the underlying data

rather simple but convey many of the concepts utilized in other, more complex models

eg linear regression, linear classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does a linear regression do)

A

find weights w0 and w1
so that the linear function f(x)=w1 º x + w0
with input x and output y
best fits the data containing ground-truth values y’

–> how can we learn w0 and w1 from data?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do linear regression models learn the weights for function f?

A

minimize square errors of prediction
with respect to ground truth
for each data point:

Least Square fitting

for data point j: [yj’ - f (wj, w0, w1) ] ^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How does least squares fitting work?

A

1) define a loss (objective) function that is the sum of the squared errors over all data points

2) find the best-fit model parameters by minimizing the loss function with respect to those two model parameters
(first derivative –> closed-form expression for best-fit w0 and w1)

least-squares + linear model function: the resulting minimum of the loss function is GLOBAL (bc of the combo)
–> the model learns immediately the best-possible solution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

When can linear functions be used as a classifier?

A

When the data is linearly separable (and only then!)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How do linear functions work as a classifier?

A

1) define decision boundary
f(x,w) = wx = w0 + w1x1 + w2x2

such that class 1: f(x,w) ≥ 0
and class 0: f(x,w) < 0

2) we can define class assignments through a threshold function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the perceptron learning rule?

A

weights are adjusted by a step size that is called the LEARNING RATE.
by iteratively running this algorithm over your training data multiple times, the weights can be learned so that the model performs properly

–> solution is learned iteratively

–> does not imply that the model is learning something useful (eg dataset might not be suitable)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why can linear models not often be applied?

A

they have low predictive capacity
–> can only be applied to data that is linearly distributed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How can linear models be fit for more capacity?

A

the base function can be changed to a polynomial:

f(x) = w0 + w1x + w2x^2 etc
Summe wix^i

the resulting regression problem is still linear in the weights to be found, therefore the same properties apply:

we can compute the parameters wi to minimize the loss with a closed-form expression. this is by default the best possible solution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What can be changed for a polynomial linear function model in order to get more capacity?

A

p (which says how many weights and x we have. have to find the best p)

–> when p is too low, the decision boundary looks like a constant, if it’s too high it can also be not a good fit (underfitting vs overfitting)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does Occam’s Razor mean?

A

a model with fewer parameters is to be preferred

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the goal of any regression model?

A

to minimize the loss over the data, ie prediction errors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a way to prevent a model from overfitting?

A

regularize the loss based on the learned weights

L’(x,w) = mean loss over N samples + regularization term

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What it the L2-norm?

A

||w||2 2 = w * w

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What happens when you add a L2 regularization term to a polynomial model?

A

The regularization term takes the weights and adds it to the loss function

–> the loss function cannot go as low as it wants to

with increasing alpha, all coefficients wi drop in magnitude, leading to smoother fits

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is L2 regularization also called?

A

ridge regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is L1 also called?

A

LASSO regression

least absolute shrinkage and selection operator

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the L1-norm?

A

||w||1 = |w|

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What does a regularization term consist of?

A

alpha and L2/L1-norm

alpha: regularization parameter

20
Q

What happens when you add a L1 regularization term to a polynomial model?

A

while L2 regularization modulates all coefficients wi in the same way,
L1 regularization aims to set less meaningful coefficients to zero

–> performs feature selection

tries to bring as many of the coefficients to 0 as it can

21
Q

How can we get closest to the minimum loss with a L2 norm in a 2D loss space spanned by w1 and w2?

A

we can only reach points on a circle to find the point that puts us closest to the global minimum

this would be inefficient if the global minimum is inside of the circle

22
Q

How can we get closest to the minimum loss with a L1 norm in a 2D loss space spanned by w1 and w2?

A

we can only reach points on a triangle defined by the two weights

23
Q

Which type of regularization should we use? L1 or L2?

A
  • L2 regularization prevents the model from overfitting by modulating the impact of its input features in a homogeneous way
  • L1 regularization prevents the model from overfitting by focusing on those features which seem to be most important

both can be deployed in any machine learning model that minimizes a loss function

24
Q

What are pros for linear models? (4)

A
  • easy to understand and implement; resource efficient even for large and sparse data sets
  • least squares method always provides best-fit results if the data is appropriate
  • good interpretability due to linear nature of the model
  • easy to regularize
25
Q

What are cons for linear models? (2)

A
  • limited flexibility: data distribution must be brought into a form that is linear (regression) or linearly separable (classification)
  • susceptible to overfitting if not combined with regularizer
26
Q

What is especially important when working with nearest-neighbor models?

A

Scaling of the data!

bc we need the distance in the model

27
Q

What are nearest neighbor models?

A

are non-parametric and simply rely on distances between data points

distances can be defined as metrics

nearest neighbor methods utilize distances between datapoints for classification and regression tasks

28
Q

What is a common distance metric in nearest neighbor models?

A

Euclidean distance

∂ = √∂x1^2 + ∂x2^2

29
Q

What is k-nearest neighbor classification?

A

k-nearest neighbor (knn) classifiers predict class affiliation of an unseen data point based on majority voting of its k nearest neighbors in a seen data set with ground-truth labels

are not trained in the general sense: distance of each unseen data point from all seen data points is calculated

30
Q

What is the impact of hyperparameter k on knn-classification?

A

k has an impact on how well the model generalizes to unseen data:
we have to perform a hyperparameter search

31
Q

What can nearest neighbor model also be implemented as?

A

as regressors that are able to interpolate between and smoothen available data
(only have to be aware that this exists)

32
Q

How do we regularize nearest neighbor methods?

A

is done through varying its hyperparameter k

a low k is susceptible to small-scale variations and noise

a high k may miss local details

33
Q

What are pros for knn classification? (3)

A
  • easy to understand, implement and results are highly interpretable
  • non-parametric
  • works reliably even with small data sets
34
Q

What are cons for knn classification? (2)

A
  • calculating distances computationally intensive for large data sets
  • performs poor on sparse data sets; prone to the curse of dimensionality
35
Q

What is the curse of dimensionality?

A

number of data points should grow exponentially with data dimensionality

if parameter space is insufficiently sampled, the model does not have enough data points for training properly –> if we have too many features

36
Q

What is a decision tree?

A

a decision tree is a rule-based structure for prediction of scalar output from (potentially) multi-dimensional input data

37
Q

what are tree properties?

A

hyperparameters:

  • tree depth (how many layers do we have)
  • number of leaves (outputs, how many classes do we have)
38
Q

What decision tree learning? How do they learn?

A

a greedy divide-and-conquer strategy is adopted to train decision trees on a training data set in a recursive fashion

1) identify the “most important feature” (greedy)
2) split the samples across this feature (divide)
3) if all samples of a branch are of the same class, create a leaf, unless number of leafs has been reached, and stop
4) if not all samples of a branch are of the same class, recursively apply algorithm again to that branch until maximum depth reached

39
Q

What is the “most important feature” in decision trees?

A

generally, it means the feature that makes the most difference to the classification of a single sample

there are different implementations of this definition eg utilizing information entropy or other useful measures

40
Q

Can single decision trees generalize well?

A

No

Single decision trees typically generalize only to some extent due to their limited depth and size;
they have low model capacity

41
Q

How can decision trees be optimized so they perform besser?

A

ensemble methods - group the trees, which increases their capacity

by combining a large number of decision trees and letting them make decisions in an averaged vote or majority vote, we increase their capacity

42
Q

What are random forests?

A

Trees in a random forest are shallower than other decision tree models. the trees therefore act as “weak learners” (intentionally) that perform badly by themselves.

however, combining a large number of weak learners performs much better than individual trees.

the intuition behind is that weak learners “on average” compensate for their individual shortcomings.

43
Q

What are gradient-boosted tree-based models?

A

random forests (decision tree ensembles) that are built successively in such a way that every newly created tree compensates for the shortcomings of the previous trees

gradient-boosting refers to the fact that new base learners (individual devision trees) are fitted to the model’s pseudo-residuals, based on the gradient of the loss of the ensemble

–> loss decreases with each added tree

44
Q

What tasks can gradient-boosted models do?

A

they are very successful in regression and classification tasks
and still represent state-of-the-art in traditional ML

common implementations:
- XGBoost
- LightGBM

45
Q

What are pros of tree-based models? (4)

A
  • extremely versatile and robust
  • can be trained on small amounts of data
  • non-parametric
  • interpretability: tree-based models are to compute “feature importances”
46
Q

What are cons of the tree-based models? (1)

A
  • decision boundaries and regression predictions may be discrete instead of continuous