lecture 2 - linear models Flashcards
What is the goal of linear regression?
To predict the value of a target variable t for a new input x, given a training dataset comprising N observations {x_n, t_n}.
How does the simplest approach to linear regression work?
In the simplest approach, linear regression involves directly constructing an appropriate function y(x) such that for new inputs x, the function predicts corresponding values of t.
How can linear regression be approached from a probabilistic perspective?
From a probabilistic perspective, we aim to model the predictive distribution p(t∣x), which expresses the uncertainty about the value of t for each value of x.
Why is the probabilistic approach useful in linear regression?
The probabilistic approach allows us to make predictions of t for any new value of x, minimizing the expected value of a suitably chosen loss function.
What is the goal of (polynomial) curve fitting?
To exploit the training set to discover the underlying function and make predictions for new inputs, even though individual observations are corrupted by noise.
Why is polynomial curve fitting considered linear in the parameters?
- Although the function y(x,w) is nonlinear in x, it is linear in the parameters w.
- This means that the model is a linear combination of the coefficients.
How are the coefficients in polynomial curve fitting determined?
- by fitting the polynomial to the training data.
- this is done by minimizing an error function that measures the misfit between the model predictions and the target values.
What is the error function used in polynomial curve fitting?
- E(w) = (1/2) sum( y(x_n,w) - t_n)^2
- sums over the number of data points
How is the optimal polynomial determined in polynomial curve fitting?
- finding the value of w that minimizes the error function.
- Since the error function is quadratic in w, it is convex, ensuring a unique minimum that can be calculated directly.
What is model selection in polynomial curve fitting?
- Model selection is the process of choosing the order M of the polynomial.
- A higher-order polynomial may fit the training data better, but it can result in overfitting if it poorly represents the underlying function.
What is overfitting in polynomial curve fitting?
- Overfitting occurs when the model fits the training data too closely, including noise, resulting in a poor ability to generalize to new data.
- This often happens with high-order polynomials.
How can overfitting be detected?
- Overfitting can be detected by comparing the error on a training set and a separate test set.
- Overfitted models show very low training error but high test error.
What metric is used to assess generalization performance?
- The root-mean-square error (RMSE)
- for each choice of M, we evaluate the residual value of E(w*) for both the test and training sets (separately)
- becomes lower on the overfitted training set, but higher on the test set
Why does performance get worse as the polynomial order M increases?
As M increases:
- The magnitude of the coefficients increases significantly.
- Higher-order polynomials fit the training data exactly, including noise, rather than capturing the true trend.
- This results in poor generalization to new data.
What happens when M=9 and there are 10 coefficients in polynomial curve fitting?
- The training error goes to zero because the polynomial exactly fits all data points.
- The test error becomes very large due to overfitting, and the function exhibits wild oscillations.
What happens when M=0 and there are 10 coefficients in polynomial curve fitting?
- the model is a straight line
- w_0 + w_1
What happens when M=1 and there are 10 coefficients in polynomial curve fitting?
- the model is a linear line
- w_0 + w_1x_1
What general principle can be learned from overfitting?
Overfitting is a general property of maximum likelihood estimation. It can also occur in deep learning when training on a small dataset, leading to poor generalization.
What is the effect of adding observations to the training set on training error?
- Adding more data provides more information for the model to learn from.
- Training error may increase slightly because the model has to fit a more diverse dataset, but this improves generalization and reduces overfitting.
What is the effect of adding observations to the training set on test error?
Test error typically decreases because a larger training set helps the model generalize better to unseen data, reducing overfitting.
What is the effect of removing observations from the training set on training error?
- Removing observations reduces the information available for training, making the model less robust.
- Training error might decrease because the model fits the reduced dataset better, but this often leads to overfitting and worse generalization.
What is the effect of removing observations from the training set on test error?
Test error typically increases because the model is more prone to overfitting the smaller training set.
What is the effect of adding observations to the test set?
- Adding observations to the test set does not affect the training error.
- It provides a more reliable estimate of the model’s generalization performance, as the test set becomes more representative of the underlying data distribution.
What is the effect of removing observations from the test set?
- Removing observations reduces the ability to accurately assess the generalization error.
- Test error might appear to improve due to fewer diverse samples, but this can give a misleading picture of the model’s true performance.
Summary: How do changes in the training and test sets affect model performance?
- Adding training data improves generalization and reduces test error.
- Removing training data increases overfitting and worsens test error.
- Adding test data provides a more reliable estimate of generalization.
- Removing test data reduces the ability to assess model performance accurately.
Why can we afford to use more complex models with larger datasets?
With larger datasets, the model has more diverse data points to learn from, which reduces the risk of overfitting and improves generalization performance, allowing more flexible models to be used.
What is the goal of ridge regression (L2 regularization)?
To discourage large coefficients by adding a penalty term to the error function, ensuring the model is simpler and more generalizable. This reduces overfitting.
How is the error function modified in ridge regression?
- The modified error function \tilde{E}(w) includes
- a penalty term proportional to the squared norm of the coefficients
- lambda, which controls the strength of the regularization