Lecture 4 - K Neighbour Flashcards
What are the 3 ways generalisation error can be expressed?
- Bias this part of the generalisation error is due to wrong assumption, e.g. assumption that the data is linear when it is quadratic. A high bias model tends to give underfitting problem.
- Variance this part of the generalisation error is due to the model’s excessive
sensitivity to small variation in the training data. A high variance model tends to give overfitting problem. - Irreducible error typically due to the natural data variability.
What effect does changing the models complexity do?
- Increasing a model’s complexity will typically increase its variance and reduce its bias.
- Reducing a model’s complexity increases its bias and reduces its variance
What is Bias - In regards to Bias/Variance Tradeoff?
Bias: The amount by which the expected model predictions differ from the true value or target over the training data.
High-Bias algorithms tend to be less flexible with stronger assumptions about the target function. They usually have lower predictive performance.
What is Underfitting the data
Underfitting occurs when the model is too simple to learn the underlying structure of the data.
The main options to minimise this problem are:
- Selecting a more powerful model, with more parameters;
- Feeding better features to the learning algorithm (feature engineering);
- Reducing the constraints on the model (e.g., reducing the regularisation hyperparameter).
Caused when there is high bias
What is Variance - In regards to Bias/Variance Tradeoff?
Variance: How much the model changes depending on (subtle) changes in the training data.
High-Variance algorithms tend to be very flexible with weaker assumptions about the target function. They may account for every single training example, therefore overfitting it.
What is Overfitting the data?
Overfitting occurs when the model is strongly influenced by the specifics of the
training data.
The main options to minimise this problem are:
- Getting more training data (if they come from the same mechanism generating the data);
- Employing techniques like k-fold cross-validation to assess model performance on multiple subsets of the data;
- Reducing the dimensionality of the problem (e.g. by using feature selection or dimensionality reduction methods);
- Increasing the constraints on the model (e.g., increasing the regularisation hyperparameter).
What is the overall outcome of High bias vs High Variance?
High bias: performance on the training set
A larger or new set of features may help.
ML models with higher variance (more complex) may help.
=====
High variance: performance on the validation set
Getting more training instances may help.
Smaller set of features may help.
Why do we need to tune hyperparameters?
A hyperparameter is a parameter of a learning algorithm (not of the model). It is not affected by the learning algorithm itself, and it must be set prior to training and remains constant during training.
Using the test set to pick the best hyperparameter values tends to make the model not perform well new data other than the test set, we should also need a validation set.
What is k-Fold Cross-Validation?
Train multiple models with various hyperparameters on the reduced training set (i.e., the full training set minus the validation set), and you select the model that performs best on the validation set.
The training set is split into k subsets (folds). For each k iteration, the model is trained and validated using a different combination of such sets.
What is a Grid Search?
What is the error rate?
The error rate on the test set is an estimation of the generalisation error (or out-of-sample error).
What do you do after the holdout validation process?
After this holdout validation process, you train the best model on the full training set (including the validation set), and this gives you the final model. Lastly, you evaluate this final model on the test set to get an estimate of the generalisation error.
What is early stopping?
Another way to regularise iterative learning algorithms such as Gradient Descent is to stop training as soon as the validation error reaches a minimum. This is called early stopping.
What are Regularised Linear Models?
A good way to reduce overfitting is to regularise the model (i.e., to constrain it): the fewer parameters it has, the harder it will be for it to overfit the data. For example, a simple way to regularise a polynomial model is to reduce the number of polynomial degrees.
For a linear model, regularisation is typically achieved by constraining the weights of the model. We will firstly look at three different ways to constrain the weights:
Ridge Regression
Lasso Regression
Elastic Net
What is Ridge Regression?
Ridge Regression (the squared ℓ2 regularisation) is a regularised version of Linear
Regression, with the regularisation term equalling to (REFER TO SLIDES FOR FORMULA)which is added to the cost function. This forces the learning algorithm to not only fit the data but also keep the model weights as small as possible. Note that the regularisation term should only be added to the cost function during training. Once the model is trained, you evaluate the model’s performance using the unregularised performance measure.
REFER TO SLIDES FOR FORMULA
What is Lasso Regression
REFER TO SLIDES FOR FORMULA
What is the difference between Ridge and Lasso regularisations (regressions)
REFER TO SLIDES
What is Elastic Net?
Elastic Net is a middle ground between Ridge Regression and Lasso Regression. It
combines both regularisation terms together.
REFER TO SLIDES FOR FORMULA
What is Softmax Regression?
Logistic Regression can be generalized to support multiple classes directly. This is
called Softmax Regression or Multinomial Logistic Regression.
Given an instance x, the Softmax Regression model first computes a score sk (x) for
each class k, then estimates the probability ˆpk of each class by applying the softmax function (also called the normalized exponential) to the scores:
REFER TO SLIDES FOR FORMULA
What is the Cost Function of Software Regression?
Similarly to the Logistic Regression, the objective of training the model is to estimate a high probability for the target class (and therefore a low probability for the other classes).
REFER TO SLIDES FOR FORMULA
What is Petal length and Petal Width
What is k-Nearest Neighbour?
Basic idea: similar instances will be closer to each other in the feature space.
Based on measures of similarity: the distance between the instances in the feature
space is defined by a distance metric.
Used for classification and regression.
What is the Minkowski Distance?
Minkowski distance can be used to calculate the distance between two data instances xi and xj with n features:
REFER TO SLIDES FOR FORMULA
k-Nearest Neighbour Example
REFER TO SLIDES
What are some Advantages and Disadvantages(tradeoffs) of K-Nearest Neighbour?
- Memory-intensive, but simple and intuitive.
- Expensive testing or prediction.
- Works better when the density of the feature space is high and similar for each class.
- Requires a meaningful distance metric.
- Noise and outliers may have negative effect.
Tradeoff: - Small values of k: risk of overfitting.
- Higher values of k: risk of underfitting.
What is Feature Normalisation - > linked to Attributed-weighted k-NN?
Feature normalisation: the larger the scale the larger the influence of the feature
REFER TO SLIDES FOR FORMULA
Multiclass Classification Example
REFER TO SLIDES