lecture 2 - linear models Flashcards

1
Q

What is the goal of linear regression?

A

To predict the value of a target variable t for a new input x, given a training dataset comprising N observations {x_n, t_n}.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How does the simplest approach to linear regression work?

A

In the simplest approach, linear regression involves directly constructing an appropriate function y(x) such that for new inputs x, the function predicts corresponding values of t.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How can linear regression be approached from a probabilistic perspective?

A

From a probabilistic perspective, we aim to model the predictive distribution p(t∣x), which expresses the uncertainty about the value of t for each value of x.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why is the probabilistic approach useful in linear regression?

A

The probabilistic approach allows us to make predictions of t for any new value of x, minimizing the expected value of a suitably chosen loss function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the goal of (polynomial) curve fitting?

A

To exploit the training set to discover the underlying function and make predictions for new inputs, even though individual observations are corrupted by noise.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why is polynomial curve fitting considered linear in the parameters?

A
  • Although the function y(x,w) is nonlinear in x, it is linear in the parameters w.
  • This means that the model is a linear combination of the coefficients.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How are the coefficients in polynomial curve fitting determined?

A
  • by fitting the polynomial to the training data.
  • this is done by minimizing an error function that measures the misfit between the model predictions and the target values.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the error function used in polynomial curve fitting?

A
  • E(w) = (1/2) sum( y(x_n,w) - t_n)^2
  • sums over the number of data points
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How is the optimal polynomial determined in polynomial curve fitting?

A
  • finding the value of w that minimizes the error function.
  • Since the error function is quadratic in w, it is convex, ensuring a unique minimum that can be calculated directly.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is model selection in polynomial curve fitting?

A
  • Model selection is the process of choosing the order M of the polynomial.
  • A higher-order polynomial may fit the training data better, but it can result in overfitting if it poorly represents the underlying function.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is overfitting in polynomial curve fitting?

A
  • Overfitting occurs when the model fits the training data too closely, including noise, resulting in a poor ability to generalize to new data.
  • This often happens with high-order polynomials.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How can overfitting be detected?

A
  • Overfitting can be detected by comparing the error on a training set and a separate test set.
  • Overfitted models show very low training error but high test error.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What metric is used to assess generalization performance?

A
  • The root-mean-square error (RMSE)
  • for each choice of M, we evaluate the residual value of E(w*) for both the test and training sets (separately)
  • becomes lower on the overfitted training set, but higher on the test set
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why does performance get worse as the polynomial order M increases?

A

As M increases:

  1. The magnitude of the coefficients increases significantly.
  2. Higher-order polynomials fit the training data exactly, including noise, rather than capturing the true trend.
  3. This results in poor generalization to new data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What happens when M=9 and there are 10 coefficients in polynomial curve fitting?

A
  1. The training error goes to zero because the polynomial exactly fits all data points.
  2. The test error becomes very large due to overfitting, and the function exhibits wild oscillations.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What happens when M=0 and there are 10 coefficients in polynomial curve fitting?

A
  • the model is a straight line
  • w_0 + w_1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What happens when M=1 and there are 10 coefficients in polynomial curve fitting?

A
  • the model is a linear line
  • w_0 + w_1x_1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What general principle can be learned from overfitting?

A

Overfitting is a general property of maximum likelihood estimation. It can also occur in deep learning when training on a small dataset, leading to poor generalization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the effect of adding observations to the training set on training error?

A
  • Adding more data provides more information for the model to learn from.
  • Training error may increase slightly because the model has to fit a more diverse dataset, but this improves generalization and reduces overfitting.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the effect of adding observations to the training set on test error?

A

Test error typically decreases because a larger training set helps the model generalize better to unseen data, reducing overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the effect of removing observations from the training set on training error?

A
  • Removing observations reduces the information available for training, making the model less robust.
  • Training error might decrease because the model fits the reduced dataset better, but this often leads to overfitting and worse generalization.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the effect of removing observations from the training set on test error?

A

Test error typically increases because the model is more prone to overfitting the smaller training set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the effect of adding observations to the test set?

A
  • Adding observations to the test set does not affect the training error.
  • It provides a more reliable estimate of the model’s generalization performance, as the test set becomes more representative of the underlying data distribution.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the effect of removing observations from the test set?

A
  • Removing observations reduces the ability to accurately assess the generalization error.
  • Test error might appear to improve due to fewer diverse samples, but this can give a misleading picture of the model’s true performance.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Summary: How do changes in the training and test sets affect model performance?

A
  1. Adding training data improves generalization and reduces test error.
  2. Removing training data increases overfitting and worsens test error.
  3. Adding test data provides a more reliable estimate of generalization.
  4. Removing test data reduces the ability to assess model performance accurately.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Why can we afford to use more complex models with larger datasets?

A

With larger datasets, the model has more diverse data points to learn from, which reduces the risk of overfitting and improves generalization performance, allowing more flexible models to be used.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is the goal of ridge regression (L2 regularization)?

A

To discourage large coefficients by adding a penalty term to the error function, ensuring the model is simpler and more generalizable. This reduces overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

How is the error function modified in ridge regression?

A
  • The modified error function \tilde{E}(w) includes
  1. a penalty term proportional to the squared norm of the coefficients
  2. lambda, which controls the strength of the regularization
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What happens when lnλ=−∞ ?

A
  • No regularization
  • The polynomial overfits, fitting the noise in the data
  • The coefficients take on very large values
30
Q

What happens as lnλ increases?

A
  1. The polynomial becomes smoother, fitting the general trend without overfitting the noise.
  2. The coefficients shrink significantly, reducing model flexibility and overfitting.
31
Q

What happens when lnλ=0 ?

A
  • Too much regularization
  • The fit becomes very simple, underfitting the data and failing to capture the trend.
  • The coefficients become almost zero, resulting in a straight-line prediction.
32
Q

How does regularization affect the RMSE plot wrt lnλ?

A
  • At small λ, the model overfits, leading to high test error and low training error.
  • As λ increases, the training error increases a bit, and the test error decreases and reaches a minimum, representing a balance between fitting the data and generalizing well (here are the reliable models).
  • At very high λ, the test and training error both increase, showing underfitting of the model
33
Q

What is the trade-off in choosing λ in ridge regression?

A

The trade-off involves balancing bias and variance:

  • Low λ: Low bias but high variance (overfitting).
  • High λ: High bias but low variance (underfitting).
  • The optimal λ minimizes test error by achieving a good bias-variance trade-off.
34
Q

What is the general linear model?

A
  • The general linear model is a linear combination of basis functions
  • ϕ_j(x) are the basis functions.
  • w is the vector of coefficients (parameters).
  • w_0 acts as a bias term when ϕ_0(x)=1
35
Q

What is the purpose of basis functions in the general linear model?

A

Basis functions handle nonlinear relationships between input variables while maintaining the analytical simplicity of a model that is linear in the parameters.

36
Q

What are global basis functions?

A
  • Global basis functions affect the entire input range when the input x changes.
  • Example: Polynomial basis function ϕ_j(x)=x^j
  • Disadvantage: Small approximation errors in specific areas affect the whole function.
37
Q

What are local basis functions?

A
  • Local basis functions affect only a limited region of the input space.
  • Example: Gaussian basis function ϕ_j(x)=exp(-(x-μ_j)^2)/(2s^2))
38
Q

What are sigmoidal basis functions?

A
  • Sigmoidal basis functions transition from 0 to 1 over a certain range of x
  • ϕ_j(x)= σ((x-μ_j)/s)
39
Q

How is the total number of parameters in the general linear model determined?

A

The total number of parameters is M, consisting of w_0 (bias) and M−1 coefficients for the basis functions

40
Q

How is the target variable
t modeled in maximum likelihood estimation?

A
  • t=y(x,w)+ϵ

Where:

  1. y(x,w) is the deterministic model (e.g., linear model).
  2. ϵ is noise, which is normally distributed with mean 0 and variance β^−1
41
Q

How is noise ϵ distributed in maximum likelihood estimation?

A
  • Noise ϵ is normally distributed with mean 0 and variance 𝛽^−1
  • p(ϵ∣β)=N(ϵ∣0,β^−1)
42
Q

What is the conditional probability of t given x,w, and β^−1?

A
  • follows a normal distribution that indicates that t is normally distributed around the model output y(x,w) with variance β^−1
  • p(t∣x,w,β^−1) = N(t∣y(x,w),β^−1)
43
Q

What does the conditional mean E[t∣x] represent?

A
  • the expected value of t given x, which is equal to the model output
  • E[t|x] = y(x,w)
  • the optimal prediction for a new value of x will be given by the conditional mean y(x,w) of the target variable
44
Q

What does it mean when the conditional distribution of
t given x is unimodal?

A
  • A unimodal conditional distribution implies that the probability of observing
    𝑡 is highest around the model output y(x,w)
  • the variance β^−1 determines the spread of the distribution around this prediction.
45
Q

What is the likelihood function in the context of maximum likelihood estimation?

A

The likelihood function p(t∣X,w,β) is the product of the individual Gaussian likelihoods for each input vector

46
Q

Why do we take the logarithm of the likelihood function?

A

Taking the logarithm of the likelihood function:

  1. Turns products into sums, reducing computational cost.
  2. Makes it easier to take derivatives, since differentiation of sums is simpler than products.
47
Q

What is the log-likelihood function?

A
  • ln p(t∣w,β) = 2/N lnβ − 2/N ln(2π) − βE_D(w)
  • where error function E_D(w) = 1/2(sum(t_n-w^Tϕ(x_n))^2)
48
Q

What is the relationship between minimizing the error function and maximizing the log-likelihood?

A
  • Minimizing the error function E_D(w) is equivalent to maximizing the log-likelihood under the Gaussian noise assumption.
  • This provides a motivation for using the error function as a maximum likelihood solution.
49
Q

What is the gradient of the log-likelihood with respect to w?

A

sum(t_n - w^T ϕ(x_n)}ϕ(x_n)^T

50
Q

What is the closed-form solution for the maximum likelihood estimate of w?

A
  • w_ML = ((Φ^T Φ)^−1 Φ^T t
  • this is the moore-penrose pseudo-inverse
  • represents the Ordinary Least Squares (OLS) solution, which minimizes the sum of squared errors (SSE) between the predicted and target values
51
Q

What is the design matrix Φ?

A
  • contains the basis functions
  • each column corresponds to a different basis function
  • each row evaluates all the basis functions on a particular x value
52
Q

What are the key steps in deriving the closed-form solution for w?

A
  1. Start with the likelihood function for the dataset.
  2. Take the log-likelihood to simplify the product into a sum.
  3. Express the log-likelihood as the sum of squared errors (SSE).
  4. Compute the gradient of the SSE with respect to w and set it to zero.
  5. Rewrite the result in matrix form using the design matrix Φ.
  6. Solve for w to get the closed-form solution.
53
Q

What are the differences between a closed-form solution and numerical methods?

A
  • Closed-Form Solution:
  1. Directly computes the exact solution.
  2. Efficient for small problems.
  3. Requires matrix inversion.
  • Numerical Methods:
    1. Approximate the solution iteratively (e.g., gradient descent).
  1. Scales better for large or complex problems.
  2. Examples: Gradient descent, Newton’s method.
54
Q

How can the least-squares solution be represented geometrically?

A

The least-squares solution y=Φw_ML represents the predicted vector y, which is a linear combination of the basis functions weighted by the coefficients w_ML

55
Q

What is the subspace S in the context of the least-squares solution?

A

The subspace S is spanned by the basis functions, meaning it is the space where the predictions y lie. The target vector t represents the actual observed data.

56
Q

What does the predicted vector y represent geometrically?

A
  • The predicted vector y is the orthogonal projection of the target vector t onto the subspace S.
  • This projection minimizes the distance between t and S.
57
Q

What is the role of w_ML in the geometrical interpretation?

A

The weights w_ML are chosen to minimize the distance between the target vector t and its projection onto the subspace S, ensuring the best fit to the data in a least-squares sense.

58
Q

How does the dimensionality M affect the least-squares solution?

A
  • The number of basis functions ϕ_m and corresponding weights w_m depends on the dimensionality M.
  • The weights adjust the model so that the predicted vector y lies as close as possible to the target vector t in the subspace S.
59
Q

What is the objective of regularized least squares?

A

Regularized least squares aims to minimize the sum of the error term and a regularization term to prevent overfitting.

60
Q

How is the error term in regularized least squares formulated?

A

The error term is the sum of the original squared error and a quadratic regularization term.

61
Q

What is the purpose of adding a penalty term in Ridge Regression (L2 regularization)?

A

The penalty term discourages large coefficients, making the model simpler and more generalizable by minimizing both the error on the data and the size of the model’s parameters.

62
Q

How does L1 regularization (Lasso) differ from L2 regularization (Ridge)?

A
  • L1 regularization (Lasso, q=1) penalizes the absolute values of the weights, often resulting in some weights being exactly zero, promoting sparsity.
  • L2 regularization (Ridge, q=2) penalizes the squared values of the weights, shrinking them towards zero without making them exactly zero.
63
Q

What does a high value of the regularization parameter λ imply?

A

High λ implies a stronger penalty, leading to lower coefficients and more shrinkage of the model weights.

64
Q

What happens when λ=0 in regularized least squares?

A

When λ=0, there is no regularization, and the problem reduces to normal least squares regression.

65
Q

What is the penalty applied to a weight w_j in L2 regularization when q=2?

A

The penalty for w_j is proportional to the square of the weight (w_j^2).

66
Q

q = 0.5

A
  • very aggressive regularization for q values lower than 1
  • sharp diamond plot: suggests that even small deviations from zero will be penalized heavily.
67
Q

q = 1

A
  • L1/lasso
  • pushing some weights exactly to zero, making the model “sparse” (some weights will be eliminated)
  • normal diamond plot
68
Q

q = 2

A
  • L2/quadratic
  • penalizes large weights but doesn’t push any of them to zero. Instead, it makes the weights smaller overall. (squared values)
  • round plot
69
Q

q = 4

A
  • as q increases, regularization becomes less aggressive, and large weights are penalized less severely.
  • rounded square plot
70
Q

Why is L1 regularization useful for feature selection?

A

L1 regularization drives some weights to exactly zero, effectively removing the corresponding features from the model.