Bayesian Deep Learning Flashcards
lecture 11
What is the motivation for a Bayesian approach in deep learning
Not only classification and the certainty of a specific class compared to the other classes, but also the uncertainty of that certainty. The uncertainty in your parameters is propagated into the uncertainty of the predictions.
Simply put, a statistical approach to model our network certainty. Useful in any safety critical neural network applications.
How is uncertainty modeled in Bayesian classification?
The uncertainty of prediction/classification is the variance of p (y|x,w)
Mention some other applications where a Bayesian approach can be used.
- Natural interpretation for regularization
- Model selection
- Input data selection (active learning)
Why does Narada go on a mathematical spree in his lecture when using bayesian approach on deep learning?
- The denominator (integral[ (P(Y|X,w)*p(w) ] dw) of the bayesian approach is intractable. Want to find methods for estimating this denominator.
Mention some approaches to estimate the denominator of the bayesian approach
- Monte Carlo techniques (MCMC - Markov Chain Monte Carlo)
- Variational inference
- Introducing random elements in training (dropout)
Why do we bother trying to calculate p(w|D) when we are actually interested in the variance of p(y|x,w)?
Because p(y|x,w) is both directly dependent on p(w|D) and indirectly through the mean.
What is the goal of ELBO?
Same as the other approaches (monte carlo approaches), i.e find p(w|D)
What is the idea of ELBO?
Find a distribution q(w) such that q(w) is approximately p(w|D)
formulate the ELBO approach (high level formula)
arg min (with respect to q(w)) KL(q(w) || p(w|D)). I.e find the parameters of q such that q(w) and p(w|D) are similar.
What is the problem with ELBO?
ELBO uses p(w|D) to approximate p(w|D) in the high level formulation
How do we actually calculate ELBO?
Find that the KL similarity and another term ( E_{q(w)}ln(p(w,D)/q(w) ) sum to a constant ( ln p(D) ). Thus maximising the first term, i.e the above noted term is the same as minimising the KL similarity.
Mention and explain a practical bayesian approach for modeling uncertainties in your network.
Apply dropout to the testing phase of a neural network. Feed (a) input data several times through the network and calculate the variance. This approach can apparently be viewed as a bayesian approach. Perhaps because we can view dropout as sampling the weight space, in particular most of our samples will be close to our local minima i.e sampling p(w|D).
What is the idea behind Monte Carlo techniques, and what are their problems?
We cannot summerize over all possible weights, so we sample instead. Problem is that often, samples reflects non-important areas of the posterior distribution.
Mention and explain a practical bayesian approach for modeling uncertainties in your network.
Apply dropout to the testing phase of a neural network. Feed (a) input data several times through the network and calculate the variance. This approach can apparently be viewed as a bayesian approach.