Week 1: Introduction & Overview Flashcards
What is a labelled dataset?
A dataset where we know the output y corresponding to each input x
What are input features?
What is the Euclidean distance?
Euclidean distance is a measure of the straight-line distance between two points in a p-dimensional space.
What is the loss function?
Why is the kNN with k = 1 more erratic than for k =3?
Because we view the output value y as a random variable prone to (modeling) noise. With k = 1 the prediction relies on only one data point and is thus more erratic.
Why is kNN a nonparametric method?
The prediction isn’t given by some function.
Why is k in kNN a hyperparameter?
Since k isn’t learned by the kNN algorithm itself, but rather chosen beforehand. (Regular parameter values are learned when training the model).
Will k = 1 in kNN most often lead to over- or underfitting?
Overfitting.
Why is it that increasing the hyperparameter k in kNN more often will generate better generalization beyond training data?
Because the predictions will be less sensitive to peculiarities of the training data and therefore less overfitted.
If k in kNN is sufficiently large, what will be the (negative) consequence of the predictions?
The (k) neighbourhood will include all training data points and the model will reduce to predicting the mean of the data for any new input.
What is a systematic way of choosing a good k in kNN?
To use cross-validation.
Why and when should we re-scale the input variables when using kNN?
Should be done if the intervals within which each training input x ranges are on very different scales. The input which ranges within, e.g., an interval of [1000, 1500] will contribute more to the sum of the Euclidean distance than an input ranging between e.g., [0,2].
(General) Should we perform re-scaling (normalisation) on the training and/or test data?
Normalisation should be perfomed on the training data only. Then, apply this same scaling to future test data points as well - never re-scale the full data set all at once.
What are the normal equations and how do we write them?
The NE are equations obtained by setting equal to zero the partial derivatives of the sum of squared errors. The solution to the equations give the LS estimates of the regression model coefficients (theta).
What is the key goal of supervised (machine) learning?
By using some training data where we have examples of how an input x is related to an output y, predict the output for NEW test data where only x is known, using some mathematical model.