lecture 3 - bayesian inference Flashcards

Question

How does regularization affect bias and variance?

Answer 1

Regularization decreases variance but increases bias, as it simplifies the model and reduces its flexibility.

Answer 2

Increasing the number of coefficients or basis functions adds complexity, which increases variance but decreases bias.

Answer 3

The red line represents a simple model (e.g., a flat line) that underfits the data, resulting in high bias.

Answer 4

The blue line represents a very complex model that overfits the data, resulting in high variance.

Answer 5

The sweet spot is a level of complexity where the bias and variance are balanced, minimizing the overall error, which can be achieved through proper regularization.

Answer 6

It provides useful **intuition for model selection and regularization** by explaining the tradeoff between underfitting and overfitting.

Answer 7

- In theory, the bias-variance tradeoff relies on analyzing an ensemble of datasets to compute the exact bias and variance. - In real-world scenarios, we typically work with only a single dataset, not an ensemble. - Because bias and variance cannot be precisely calculated from a single dataset in practice, making it difficult to directly measure or apply the tradeoff.

Answer 8

Bayesian linear regression is a type of linear regression that incorporates Bayesian principles, allowing you to **quantify uncertainty in predictions by providing a distribution over possible parameters (weights)**.

Answer 9

Traditional linear regression provides a single "best" estimate for the parameters, whereas Bayesian regression provides a probability distribution over possible parameters, reflecting uncertainty.

Answer 10

Weights w are treated as **random variables with a distribution**, and the goal is to find the posterior distribution of the weights.

Answer 11

- The posterior distribution is proportional to the product of the likelihood and the prior - p(w∣t)∝p(t∣w)p(w)

Answer 12

The likelihood describes how well the data is explained given the current set of weights.

Answer 13

The prior encodes our initial belief about the parameters before observing the data.

Answer 14

The posterior represents the probability of a set of weights given the observed data, providing an **updated estimate of the weights** based on the prior and likelihood.

Answer 15

The key idea is to use a **probability distribution over the weights and update this distribution based on the observed data**, resulting in both the data and weights following probability distributions.

Answer 16

Using a conjugate prior simplifies the computation by ensuring that the posterior distribution is in the same family as the prior, allowing easy tracking of the mean and covariance.

Answer 17

The posterior distribution will be a normal distribution with a mean m_N and variance S_N.

Answer 18

- The prior is represented as a normal distribution p(w)=N(w|m_0, S_0) - m_0 is the prior mean and - S_0 is the prior covariance matrix.

Answer 19

- The prior is represented as a normal distribution p(w|t)=N(w|m_N, S_N) - m_N is the posterior mean and - S_N is the posterior covariance matrix.

Answer 20

m_N = S_N(S_0^−1, m_0 + βΦ^T t)

Answer 21

S_N^−1 = S_0^−1 + βΦ^TΦ

Answer 22

To ensure that the model converges to an optimal set of weights with reduced uncertainty, improving confidence in predictions.

Answer 23

the **prior parameters** m_0 and S_0, and the **observed data**

Answer 24

- A common prior is p(w)=N(w∣0,α^−1 I) - mean of the weights is zero and the variance is 𝛼^−1 multiplied by the identity matrix

Answer 25

1. Likelihood: Irrelevant, as no data points have been observed. 2. Prior/Posterior: Broad circular distribution centered around zero, indicating high uncertainty. 3. Data Space: The possible lines span a wide range of slopes and intercepts with no specific fit.

Answer 26

1. Likelihood: Influences the posterior, forming a diagonal ridge based on the first data point. 2. Prior/Posterior: Becomes more concentrated, reducing uncertainty. 3. Data Space: Possible lines start converging toward the correct slope and intercept.

Answer 27

1. Likelihood: Becomes narrower and more informative, further restricting possible values. 2. Prior/Posterior: More concentrated, increasing confidence in weight estimates. 3. Data Space: Lines converge closer toward the actual relationship, though some uncertainty remains.

Answer 28

1. Likelihood: Highly concentrated, showing certainty about the correct slope and intercept. 2. Prior/Posterior: Tightly concentrated around the true values, with low uncertainty. 3. Data Space: Lines cluster closely around the true relationship, indicating a near-perfect fit.

Answer 29

1. **Prior Influence**: Initially dominates, with high uncertainty. 2. **Data Updates**: As more data points are observed, the posterior becomes more concentrated. 3. **Confidence Grows with Data**: Confidence in the weight estimates increases with more data. 4. **Role of Likelihood**: Each data point provides new information, refining the prior to create the posterior.

Answer 30

- The process reduces uncertainty about weights by updating the posterior as more data is observed. - Initially, there's high uncertainty, but the **posterior narrows with each data point**, leading to more confident and accurate predictions.

Answer 31

lnp(w∣t)= −β/2 SUM(t_n - w^Tϕ(x_n))^2 − α/2 w^Tw+const

Answer 32

- −β/2 SUM(t_n - w^Tϕ(x_n))^2 - penalizes the difference between the predicted value (w^T ϕ(x_n) and the actual observed value t_n. - the further the prediction deviates from the data, the lower the posterior probability for that particular set of weights.

Answer 33

- − α/2 w^Tw - penalizes large weights, ensuring the model does not stray too far from the belief that weights should be small. - This introduces (free) regularization, which helps prevent overfitting.

Answer 34

A stronger prior (larger α) discourages large weights more, influencing the regularization level and the implication of different priors on the posterior distribution.

Answer 35

It is unrelated to w, so it does not affect the optimization process and can be ignored when working with relative probabilities

Answer 36

1. The log of the posterior is a combination of the likelihood (how well the model fits the data) and the prior (how much we trust our initial belief about the weights). 2. The posterior distribution becomes sharper (more peaked) around weight values that fit the data well and respect the prior. 3. The likelihood term pulls the weights toward values that fit the observed data, while the prior term pulls the weights toward zero or plausible values based on the prior.

Answer 37

- High uncertainty: Precision goes to 0. - The influence of the prior vanishes, so **posterior relies entirely on the data**, making it equivalent to standard frequentist linear regression (OLS) (reverts back to **MLE**. - There is **no regularization**, and the posterior is only influenced by observed data.

Answer 38

- simplifies to the **ordinary least squares (OLS) solution** - m_N = (Φ^T Φ)^−1 Φ^T t - This is the same as the maximum likelihood estimate (MLE) or least squares solution in traditional linear regression.

Answer 39

- Precision goes to infinity, and the **prior dominates entirely**. - The posterior ignores the data, and predictions are determined solely by the prior. - The posterior mean simplifies to **m_N=0**, meaning the model is fully biased by the prior and does not learn from data.

Answer 40

- With infinite data, the posterior converges to the maximum likelihood estimate (MLE), and the impact of the prior diminishes. - m_N = (Φ^T Φ)^−1 Φ^T t - The **data dominates**, and the influence of the prior becomes negligible.

Answer 41

p(t∣x,α,β)= N(t|m_N^T ϕ(x),σ^2_N(x)) - uses a normal distribution with the posterior mean and the variance prediction to model the distribution of t

Answer 42

- ϕ(x),σ^2_N(x)= 1/β + ϕ(x)^T S_N ϕ(x) - this captures the inherent noise (1/β) and model uncertainty (ϕ(x)^T S_N ϕ(x))

Answer 43

1. a mean prediction (the most likely value) 2. a variance (uncertainty) for the new input x.

Answer 44

1. **Data Dependence**: The uncertainty is directly tied to the quantity and distribution of observed data points. 2. **Expressive Uncertainty**: It reflects high uncertainty in regions with sparse data and low uncertainty in densely populated regions. 3. **Gradual Learning**: As more data points become available, the posterior distribution sharpens, leading to more precise predictions and reduced uncertainty.

Answer 45

1. Maximum likelihood or least squares can lead to overfitting if complex models are trained on limited data. 2. Bayesian methods avoid overfitting by quantifying uncertainty in model parameters.

Answer 46

1. **Start with the Prior**: "We begin with a prior belief about model parameters, representing our knowledge before observing any data." 2. **Incorporate the Data**: "As data points are observed, the likelihood tells us how well each line fits the data. Combining it with the prior gives a new belief about the parameters." 3. **Predict with Uncertainty**: "For a new input, instead of predicting a single value, we provide a distribution that includes both the prediction and the uncertainty due to limited data."

Answer 47

because they involve minimizing the error related to a set of weights.

Answer 48

Convexity ensures that there is a single global optimal solution, meaning the optimization process is guaranteed to find the best solution.

Answer 49

1. The gradient indicates the direction of the steepest ascent. 2. Taking the negative gradient gives the direction of steepest descent. 3. Steps are taken in this direction, and the process is repeated until convergence.

Answer 50

1. The influence of the step size depends on the curvature of the parameter space. 2. In steep regions, a large step size may overshoot the minimum, while in flat regions, the same step size may result in slow convergence and oscillations.

Answer 51

Feature scaling normalizes the data so that step sizes in all directions are approximately the same, improving the efficiency and stability of the optimization process.

Answer 52

- w_j := w_j −α (∂E(w_0,w_1) / ∂w_j)

Answer 53

1. Calculate the updated values separately without updating them yet and store them in temporary values 2. Update all weights at the same time using the temporary values

Answer 54

Storing updated values separately **ensures that each weight update is based on the original values**, preventing incorrect updates caused by using already updated weights.

Answer 55

1. Too large: The values oscillate and may diverge. 2. Too small: Convergence becomes very slow.

Answer 56

- Convexity is crucial because it **ensures global optimality**. - In a convex parameter space, **gradient descent is guaranteed to find the global minimum**.

Answer 57

the trajectory followed by gradient descent depends on the initialization point, which may lead to finding local minima instead of the global minimum

Answer 58

only in models with convex parameter spaces, such as linear models and logistic regression

lecture 3 - bayesian inference Flashcards

(82 cards)