lecture 3 - bayesian inference Flashcards

1
Q

What is a common pitfall when using frequentist statistics?

A

Frequentist statistics often minimize error based on averages without accounting for the uncertainty around predictions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How is data modeled in the frequentist framework?

A
  • The full dataset is modeled as y(x,w)
  • A datapoint is modeled as a distribution where the mean is given by y(x_0, w) for an input x_0
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does p(t∣x) represent in the frequentist framework?

A
  • For each fixed x, p(t∣x) represents the probability of observing t, which is assumed to be normally distributed.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the definition of the expected loss function E[L] in machine learning?

A

the average loss across possible values of t, weighted by their probability, expressed as a double integral over x and t.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does minimizing the expected loss function involve?

A

Taking the derivative with respect to the model’s prediction y(x) and setting it to zero.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What result do you get when taking the derivative of the expected loss function with respect to y(x)?

A
  • 2 ∫ {y(x)-t} p(t|x) dt
  • indicates that adjustments should be made to reduce the difference between the prediction y(x) and the target t.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do you find the optimal prediction y(x) after taking the derivative of the expected loss function?

A

By setting the derivative to zero and solving, you find that the optimal prediction y(x) is the expected value of t given x, denoted as E[t∣x].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

If you do not use a model, how can you estimate the best value of t for a given x?

A

By taking the expected value of t at that fixed value of x.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the purpose of adding zero in the form of −E[t∣x]+E[t∣x]=0 when decomposing the expected loss function?

A

The purpose is to separate the prediction y(x) from the true expected value E[t∣x], which serves as the best predictor of t given x.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What happens to the cross term 2(y(x)−E[t∣x])(E[t∣x]−t) when taking the expectation of the expanded loss function?

A

The cross term disappears because E[t∣x]−t has zero mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the two main components left after taking the expected value of the expanded loss function?

A
  1. (y(x)−E[t∣x]): Error of the model in approximating the true expected value.
  2. Var(t∣x): Inherent variance in the data, independent of the model (noise).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why is variance related to
t always present in the model?

A
  • Variance related to t is intrinsic noise in the random variable t and cannot be changed, meaning there will always be some noise-related error.
  • Var(t∣x)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the components of the bias-variance decomposition of the expected loss?

A
  • (Bias)^2: The squared difference between the expected model prediction and the true value.
  • Variance: The variability of the model prediction.
  • Noise: The inherent variability in the data that cannot be controlled by changing the model.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How can bias and variance be controlled in a model?

A
  • The model bias and variance are related to eachother in that if you increase one, the other decreases (and vice versa)
  • Bias can be reduced by increasing model complexity, but this may increase variance.
  • Variance can be reduced by simplifying the model, but this may increase bias.
  • In deep learning, it’s possible to reduce both bias and variance through techniques like regularization and large datasets.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the final form of the expected loss after decomposition?

A

Expectedloss = (Bias)^2 + Variance + Noise

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How does regularization affect the bias-variance tradeoff in a model?

A

Regularization controls the tradeoff between bias and variance, influencing the model’s performance by adjusting flexibility.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What happens when the regularization parameter λ is high?

A
  1. Little model variance, meaning minimal difference in the estimates produced by different models (low variance).
  2. High bias, as the model is too rigid and far from the true function.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the effects of low regularization (λ is low)?

A
  1. High variance because the model has more flexibility to fit the data.
  2. Low bias, as the model better approximates the true function.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What happens when the model is over-regularized (extremely low λ)?

A

The model exhibits even more variability (high variance), allowing it to be highly flexible, but with very little bias.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What happens to bias as regularization decreases?

A

Bias decreases as regularization decreases because the model becomes more flexible and can better fit the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What happens to variance as regularization decreases?

A

Variance increases as regularization decreases because the model becomes more complex and can overfit the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the relationship between test error, bias, and variance?

A

The test error behaves similarly to the combined bias squared and variance curve, but at a slightly higher error value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the goal when tuning the regularization parameter?

A

The goal is to find a balance where the sum of bias and variance is minimized, leading to the lowest test error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

How does model complexity affect bias and variance?

A

Model complexity increases variance but decreases bias, as more complex models can fit the data more closely.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

How does regularization affect bias and variance?

A

Regularization decreases variance but increases bias, as it simplifies the model and reduces its flexibility.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What happens when you increase the number of coefficients or basis functions in a model (e.g., linear regression or neural networks)?

A

Increasing the number of coefficients or basis functions adds complexity, which increases variance but decreases bias.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

In the context of bias-variance tradeoff, what does the red line in the plot represent?

A

The red line represents a simple model (e.g., a flat line) that underfits the data, resulting in high bias.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

In the context of bias-variance tradeoff, what does the blue line in the plot represent?

A

The blue line represents a very complex model that overfits the data, resulting in high variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is the “sweet spot” of model complexity?

A

The sweet spot is a level of complexity where the bias and variance are balanced, minimizing the overall error, which can be achieved through proper regularization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Why is the bias-variance tradeoff valuable in machine learning?

A

It provides useful intuition for model selection and regularization by explaining the tradeoff between underfitting and overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Why is the bias-variance tradeoff of limited practical value?

A
  • In theory, the bias-variance tradeoff relies on analyzing an ensemble of datasets to compute the exact bias and variance.
  • In real-world scenarios, we typically work with only a single dataset, not an ensemble.
  • Because bias and variance cannot be precisely calculated from a single dataset in practice, making it difficult to directly measure or apply the tradeoff.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is Bayesian linear regression?

A

Bayesian linear regression is a type of linear regression that incorporates Bayesian principles, allowing you to quantify uncertainty in predictions by providing a distribution over possible parameters (weights).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

How does Bayesian regression differ from traditional linear regression?

A

Traditional linear regression provides a single “best” estimate for the parameters, whereas Bayesian regression provides a probability distribution over possible parameters, reflecting uncertainty.

34
Q

In Bayesian regression, how are weights w treated?

A

Weights w are treated as random variables with a distribution, and the goal is to find the posterior distribution of the weights.

35
Q

What is the formula for the posterior distribution in Bayesian regression?

A
  • The posterior distribution is proportional to the product of the likelihood and the prior
  • p(w∣t)∝p(t∣w)p(w)
36
Q

What is the role of the likelihood p(t∣w) in Bayesian regression?

A

The likelihood describes how well the data is explained given the current set of weights.

37
Q

What is the role of the prior p(w) in Bayesian regression?

A

The prior encodes our initial belief about the parameters before observing the data.

38
Q

What does the posterior p(w∣t) represent?

A

The posterior represents the probability of a set of weights given the observed data, providing an updated estimate of the weights based on the prior and likelihood.

39
Q

What is the key idea behind Bayesian regression?

A

The key idea is to use a probability distribution over the weights and update this distribution based on the observed data, resulting in both the data and weights following probability distributions.

40
Q

What is the purpose of using a conjugate prior distribution in Bayesian linear regression?

A

Using a conjugate prior simplifies the computation by ensuring that the posterior distribution is in the same family as the prior, allowing easy tracking of the mean and covariance.

41
Q

What happens if the prior and likelihood are conjugate in Bayesian inference?

A

The posterior distribution will be a normal distribution with a mean m_N and variance S_N.

42
Q

How is the prior distribution p(w) typically represented in Bayesian linear regression?

A
  • The prior is represented as a normal distribution p(w)=N(w|m_0, S_0)
  • m_0 is the prior mean and
  • S_0 is the prior covariance matrix.
43
Q

How is the posterior distribution represented in Bayesian linear regression?

A
  • The prior is represented as a normal distribution p(w|t)=N(w|m_N, S_N)
  • m_N is the posterior mean and
  • S_N is the posterior covariance matrix.
44
Q

How is the posterior mean m_N calculated?

A

m_N = S_N(S_0^−1, m_0 + βΦ^T t)

45
Q

How is the posterior covariance S_N calculated?

A

S_N^−1 = S_0^−1 + βΦ^TΦ

46
Q

Why do we want the posterior mean m_N to change its location and decrease its variance?

A

To ensure that the model converges to an optimal set of weights with reduced uncertainty, improving confidence in predictions.

47
Q

what do the posterior mean m_N and posterior covariance S_N depend on

A

the prior parameters m_0 and S_0, and the observed data

48
Q

What is a common choice for the prior in Bayesian linear regression?

A
  • A common prior is p(w)=N(w∣0,α^−1 I)
  • mean of the weights is zero and the variance is 𝛼^−1 multiplied by the identity matrix
49
Q

What happens before observing any data (initial prior)?

A
  1. Likelihood: Irrelevant, as no data points have been observed.
  2. Prior/Posterior: Broad circular distribution centered around zero, indicating high uncertainty.
  3. Data Space: The possible lines span a wide range of slopes and intercepts with no specific fit.
50
Q

How does the posterior update after observing one data point?

A
  1. Likelihood: Influences the posterior, forming a diagonal ridge based on the first data point.
  2. Prior/Posterior: Becomes more concentrated, reducing uncertainty.
  3. Data Space: Possible lines start converging toward the correct slope and intercept.
51
Q

What changes after observing two data points?

A
  1. Likelihood: Becomes narrower and more informative, further restricting possible values.
  2. Prior/Posterior: More concentrated, increasing confidence in weight estimates.
  3. Data Space: Lines converge closer toward the actual relationship, though some uncertainty remains.
52
Q

What happens after observing 20 data points?

A
  1. Likelihood: Highly concentrated, showing certainty about the correct slope and intercept.
  2. Prior/Posterior: Tightly concentrated around the true values, with low uncertainty.
  3. Data Space: Lines cluster closely around the true relationship, indicating a near-perfect fit.
53
Q

What are the key takeaways from Bayesian linear regression’s updating process?

A
  1. Prior Influence: Initially dominates, with high uncertainty.
  2. Data Updates: As more data points are observed, the posterior becomes more concentrated.
  3. Confidence Grows with Data: Confidence in the weight estimates increases with more data.
  4. Role of Likelihood: Each data point provides new information, refining the prior to create the posterior.
54
Q

How does Bayesian linear regression adjust predictions with more data?

A
  • The process reduces uncertainty about weights by updating the posterior as more data is observed.
  • Initially, there’s high uncertainty, but the posterior narrows with each data point, leading to more confident and accurate predictions.
55
Q

How is the log of the posterior distribution expressed in Bayesian linear regression?

A

lnp(w∣t)= −β/2 SUM(t_n - w^Tϕ(x_n))^2 − α/2 w^Tw+const

56
Q

What does the log likelihood term

A
  • −β/2 SUM(t_n - w^Tϕ(x_n))^2
  • penalizes the difference between the predicted value (w^T ϕ(x_n) and the actual observed value t_n.
  • the further the prediction deviates from the data, the lower the posterior probability for that particular set of weights.
57
Q

What is the role of the prior term in the log of the posterior

A
  • − α/2 w^Tw
  • penalizes large weights, ensuring the model does not stray too far from the belief that weights should be small.
  • This introduces (free) regularization, which helps prevent overfitting.
58
Q

How does the strength of the prior affect the model?

A

A stronger prior (larger α) discourages large weights more, influencing the regularization level and the implication of different priors on the posterior distribution.

59
Q

Why can the constant term in the log of the posterior be ignored when maximizing the posterior?

A

It is unrelated to w, so it does not affect the optimization process and can be ignored when working with relative probabilities

60
Q

What are the key takeaways from the log of the posterior formula?

A
  1. The log of the posterior is a combination of the likelihood (how well the model fits the data) and the prior (how much we trust our initial belief about the weights).
  2. The posterior distribution becomes sharper (more peaked) around weight values that fit the data well and respect the prior.
  3. The likelihood term pulls the weights toward values that fit the observed data, while the prior term pulls the weights toward zero or plausible values based on the prior.
61
Q

What happens in Bayesian linear regression if there is no prior information (α→0)

A
  • High uncertainty: Precision goes to 0.
  • The influence of the prior vanishes, so posterior relies entirely on the data, making it equivalent to standard frequentist linear regression (OLS) (reverts back to MLE.
  • There is no regularization, and the posterior is only influenced by observed data.
62
Q

What is the posterior mean in the absence of prior information? (α→0)

A
  • simplifies to the ordinary least squares (OLS) solution
  • m_N = (Φ^T Φ)^−1 Φ^T t
  • This is the same as the maximum likelihood estimate (MLE) or least squares solution in traditional linear regression.
63
Q

What happens if the prior information is very precise (α→∞)?

A
  • Precision goes to infinity, and the prior dominates entirely.
  • The posterior ignores the data, and predictions are determined solely by the prior.
  • The posterior mean simplifies to m_N=0, meaning the model is fully biased by the prior and does not learn from data.
64
Q

What happens if there is infinite data (N→∞) in Bayesian regression?

A
  • With infinite data, the posterior converges to the maximum likelihood estimate (MLE), and the impact of the prior diminishes.
  • m_N = (Φ^T Φ)^−1 Φ^T t
  • The data dominates, and the influence of the prior becomes negligible.
65
Q

How is the predictive distribution for a new input
x computed in Bayesian linear regression?

A

p(t∣x,α,β)= N(t|m_N^T ϕ(x),σ^2_N(x))

  • uses a normal distribution with the posterior mean and the variance prediction to model the distribution of t
66
Q

How is the variance prediction σ^2_N(x)

A
  • ϕ(x),σ^2_N(x)= 1/β + ϕ(x)^T S_N ϕ(x)
  • this captures the inherent noise (1/β) and model uncertainty (ϕ(x)^T S_N ϕ(x))
67
Q

What does the predictive distribution p(t∣x) provide?

A
  1. a mean prediction (the most likely value)
  2. a variance (uncertainty) for the new input x.
68
Q

What does Bayesian linear regression capture in terms of uncertainty?

A
  1. Data Dependence: The uncertainty is directly tied to the quantity and distribution of observed data points.
  2. Expressive Uncertainty: It reflects high uncertainty in regions with sparse data and low uncertainty in densely populated regions.
  3. Gradual Learning: As more data points become available, the posterior distribution sharpens, leading to more precise predictions and reduced uncertainty.
69
Q

What are the conclusions about Bayesian linear regression versus frequentist approaches?

A
  1. Maximum likelihood or least squares can lead to overfitting if complex models are trained on limited data.
  2. Bayesian methods avoid overfitting by quantifying uncertainty in model parameters.
70
Q

How can you explain the Bayesian approach for exam success?

A
  1. Start with the Prior:
    “We begin with a prior belief about model parameters, representing our knowledge before observing any data.”
  2. Incorporate the Data:
    “As data points are observed, the likelihood tells us how well each line fits the data. Combining it with the prior gives a new belief about the parameters.”
  3. Predict with Uncertainty:
    “For a new input, instead of predicting a single value, we provide a distribution that includes both the prediction and the uncertainty due to limited data.”
71
Q

Why are machine learning problems considered optimization problems?

A

because they involve minimizing the error related to a set of weights.

72
Q

What does the convexity of the parameter space imply in linear models?

A

Convexity ensures that there is a single global optimal solution, meaning the optimization process is guaranteed to find the best solution.

73
Q

How does the gradient guide the optimization process?

A
  1. The gradient indicates the direction of the steepest ascent.
  2. Taking the negative gradient gives the direction of steepest descent.
  3. Steps are taken in this direction, and the process is repeated until convergence.
74
Q

How does the step size affect the optimization process?

A
  1. The influence of the step size depends on the curvature of the parameter space.
  2. In steep regions, a large step size may overshoot the minimum, while in flat regions, the same step size may result in slow convergence and oscillations.
75
Q

How does feature scaling help in optimization?

A

Feature scaling normalizes the data so that step sizes in all directions are approximately the same, improving the efficiency and stability of the optimization process.

76
Q

What is the general update rule for gradient descent?

A
  • w_j := w_j −α (∂E(w_0,w_1) / ∂w_j)
77
Q

How do you correctly update weights in gradient descent?

A
  1. Calculate the updated values separately without updating them yet and store them in temporary values
  2. Update all weights at the same time using the temporary values
78
Q

Why is it important to store the updated values separately during gradient descent?

A

Storing updated values separately ensures that each weight update is based on the original values, preventing incorrect updates caused by using already updated weights.

79
Q

Why is step size important in gradient descent?

A
  1. Too large: The values oscillate and may diverge.
  2. Too small: Convergence becomes very slow.
80
Q

Why is convexity important in optimization problems?

A
  • Convexity is crucial because it ensures global optimality.
  • In a convex parameter space, gradient descent is guaranteed to find the global minimum.
81
Q

What challenge arises when optimizing a non-convex parameter space?

A

the trajectory followed by gradient descent depends on the initialization point, which may lead to finding local minima instead of the global minimum

82
Q

In which models does gradient descent guarantee global optimality?

A

only in models with convex parameter spaces, such as linear models and logistic regression