lecture 2 - linear models Flashcards

Question

Summary: How do changes in the training and test sets affect model performance?

Answer 1

1. **Adding training** data improves generalization and reduces test error. 2. **Removing training** data increases overfitting and worsens test error. 3. **Adding test data** provides a more reliable estimate of generalization. 4. **Removing test data** reduces the ability to assess model performance accurately.

Answer 2

With larger datasets, **the model has more diverse data points to learn from, which reduces the risk of overfitting** and improves generalization performance, allowing more flexible models to be used.

Answer 3

To discourage large coefficients by adding a penalty term to the error function, ensuring the model is simpler and more generalizable. This reduces overfitting.

Answer 4

- The modified error function \tilde{E}(w) includes 1. a penalty term proportional to the squared norm of the coefficients 2. lambda, which controls the strength of the regularization

Answer 5

- No regularization - The polynomial overfits, fitting the noise in the data - The coefficients take on very large values

Answer 6

1. The polynomial becomes smoother, fitting the general trend without overfitting the noise. 2. The coefficients shrink significantly, reducing model flexibility and overfitting.

Answer 7

- Too much regularization - The fit becomes very simple, underfitting the data and failing to capture the trend. - The coefficients become almost zero, resulting in a straight-line prediction.

Answer 8

- At small λ, the model overfits, leading to high test error and low training error. - As λ increases, the training error increases a bit, and the test error decreases and reaches a minimum, representing a balance between fitting the data and generalizing well (here are the **reliable models**). - At very high λ, the test and training error both increase, showing underfitting of the model

Answer 9

The trade-off involves balancing bias and variance: - Low λ: Low bias but high variance (overfitting). - High λ: High bias but low variance (underfitting). - The optimal λ minimizes test error by achieving a good bias-variance trade-off.

Answer 10

- The general linear model is a linear combination of basis functions - ϕ_j(x) are the basis functions. - w is the vector of coefficients (parameters). - w_0 acts as a bias term when ϕ_0(x)=1

Answer 11

Basis functions handle **nonlinear relationships** between input variables while maintaining the analytical **simplicity of a model that is linear in the parameters**.

Answer 12

- Global basis functions affect the entire input range when the input x changes. - Example: Polynomial basis function ϕ_j(x)=x^j - Disadvantage: Small approximation errors in specific areas affect the whole function.

Answer 13

- Local basis functions affect only a limited region of the input space. - Example: Gaussian basis function ϕ_j(x)=exp(-(x-μ_j)^2)/(2s^2))

Answer 14

- Sigmoidal basis functions transition from 0 to 1 over a certain range of x - ϕ_j(x)= σ((x-μ_j)/s)

Answer 15

The total number of parameters is **M**, consisting of w_0 (bias) and M−1 coefficients for the basis functions

Answer 16

- t=y(x,w)+ϵ Where: 1. y(x,w) is the deterministic model (e.g., linear model). 2. ϵ is noise, which is normally distributed with mean 0 and variance β^−1

Answer 17

- Noise ϵ is normally distributed with mean 0 and variance 𝛽^−1 - p(ϵ∣β)=N(ϵ∣0,β^−1)

Answer 18

- follows a normal distribution that indicates that t is normally distributed around the model output y(x,w) with variance β^−1 - p(t∣x,w,β^−1) = N(t∣y(x,w),β^−1)

Answer 19

- the expected value of t given x, which is equal to the model output - E[t|x] = y(x,w) - **the optimal prediction for a new value of x will be given by the conditional mean y(x,w) of the target variable**

Answer 20

- A unimodal conditional distribution implies that the probability of observing 𝑡 is highest around the model output y(x,w) - the variance β^−1 determines the spread of the distribution around this prediction.

Answer 21

The likelihood function p(t∣X,w,β) is the **product of the individual Gaussian likelihoods** for each input vector

Answer 22

Taking the logarithm of the likelihood function: 1. Turns products into sums, reducing computational cost. 2. Makes it easier to take derivatives, since differentiation of sums is simpler than products.

Answer 23

- ln p(t∣w,β) = 2/N lnβ − 2/N ln(2π) − βE_D(w) - where error function E_D(w) = 1/2(sum(t_n-w^Tϕ(x_n))^2)

Answer 24

- Minimizing the error function E_D(w) is equivalent to maximizing the log-likelihood under the Gaussian noise assumption. - This provides a motivation for using the error function as a maximum likelihood solution.

Answer 25

sum(t_n - w^T ϕ(x_n)}ϕ(x_n)^T

Answer 26

- w_ML = ((Φ^T Φ)^−1 Φ^T t - this is the moore-penrose pseudo-inverse - represents the Ordinary Least Squares (OLS) solution, which minimizes the sum of squared errors (SSE) between the predicted and target values

Answer 27

- contains the basis functions - each column corresponds to a different basis function - each row evaluates all the basis functions on a particular x value

Answer 28

1. Start with the likelihood function for the dataset. 2. Take the log-likelihood to simplify the product into a sum. 3. Express the log-likelihood as the sum of squared errors (SSE). 4. Compute the gradient of the SSE with respect to w and set it to zero. 5. Rewrite the result in matrix form using the design matrix Φ. 6. Solve for w to get the closed-form solution.

Answer 29

- **Closed-Form Solution**: 1. Directly computes the exact solution. 2. Efficient for small problems. 3. Requires matrix inversion. - **Numerical Methods**: 1. Approximate the solution iteratively (e.g., gradient descent). 2. Scales better for large or complex problems. 3. Examples: Gradient descent, Newton’s method.

Answer 30

The least-squares solution y=Φw_ML represents the predicted vector y, which is a linear combination of the basis functions weighted by the coefficients w_ML

Answer 31

The subspace S is spanned by the basis functions, meaning it is the space where the predictions y lie. The target vector t represents the actual observed data.

Answer 32

- The predicted vector y is the orthogonal projection of the target vector t onto the subspace S. - This projection minimizes the distance between t and S.

Answer 33

The weights w_ML are chosen to minimize the distance between the target vector t and its projection onto the subspace S, ensuring the best fit to the data in a least-squares sense.

Answer 34

- The number of basis functions ϕ_m and corresponding weights w_m depends on the dimensionality M. - The weights adjust the model so that the predicted vector y lies as close as possible to the target vector t in the subspace S.

Answer 35

Regularized least squares aims to minimize the sum of the error term and a regularization term to prevent overfitting.

Answer 36

The error term is the sum of the original squared error and a quadratic regularization term.

Answer 37

The penalty term discourages large coefficients, making the model simpler and more generalizable by minimizing both the error on the data and the size of the model's parameters.

Answer 38

- L1 regularization (Lasso, q=1) penalizes the absolute values of the weights, often resulting in some weights being exactly zero, promoting sparsity. - L2 regularization (Ridge, q=2) penalizes the squared values of the weights, shrinking them towards zero without making them exactly zero.

Answer 39

High λ implies a stronger penalty, leading to lower coefficients and more shrinkage of the model weights.

Answer 40

When λ=0, there is no regularization, and the problem reduces to normal least squares regression.

Answer 41

The penalty for w_j is proportional to the square of the weight (w_j^2).

Answer 42

- very aggressive regularization for q values lower than 1 - sharp diamond plot: suggests that even small deviations from zero will be penalized heavily.

Answer 43

- L1/lasso - pushing some weights exactly to zero, making the model "sparse" (some weights will be eliminated) - normal diamond plot

Answer 44

- L2/quadratic - penalizes large weights but doesn't push any of them to zero. Instead, it makes the weights smaller overall. (squared values) - round plot

Answer 45

- as q increases, regularization becomes less aggressive, and large weights are penalized less severely. - rounded square plot

Answer 46

L1 regularization drives some weights to exactly zero, effectively removing the corresponding features from the model.

lecture 2 - linear models Flashcards

(70 cards)