Test Flashcards
What is a Gaussian Process (GP)?
An (infinite) set of random variables indexed by some set X
For each x ∈ X, there’s a random variable f such that for all A ⊆ X, A = {x1,…,xm}, f_A ~ N(μ_A, K_A).
What does the likelihood function P(data | f) represent?
The probability of observing the data given the function f.
What is the posterior distribution P(f | data)?
The probability of the function f given the observed data.
What are the components of a Gaussian Process?
- Mean function (µ)
- Covariance (kernel) function (k)
What is the formula for the predictive distribution in Gaussian Processes?
f | x1,…,xm, y1,…,ym = GP(f; μ0, k0)
What is the closed form formula for prediction in Gaussian Processes?
μ* = μ(x) + k,A K^-1(yA - μA)
k = k(x, x’) - k,A K^-1 kA,
What is the purpose of optimizing kernel parameters in Gaussian Processes?
To improve predictive performance.
What is one method for optimizing hyperparameters in Gaussian Processes?
Cross-validation on predictive performance.
What is the Bayesian perspective on optimizing hyperparameters?
Maximize the marginal likelihood of the data.
What does maximizing marginal likelihood help with in Gaussian Processes?
It helps guard against overfitting.
What is an Empirical Bayes method?
Estimating a prior distribution from data by maximizing marginal likelihood.
What computational cost is associated with prediction using Gaussian Processes?
Θ(n^3) due to solving linear systems.
What are some basic approaches for accelerating Gaussian Process computations?
- Exploiting parallelism (GPU computations)
- Local GP methods
- Kernel function approximations
- Inducing point methods
True or False: The posterior covariance k’ in Gaussian Processes depends on the observed data yA.
False.
Fill in the blank: The covariance function in Gaussian Processes is also known as the ______.
[kernel function]
What is the effect of kernel parameters in Gaussian Processes?
They influence the shape and smoothness of the function being modeled.
What is the significance of the mean function in a Gaussian Process?
It represents the expected value of the function at each point.
What do inducing point methods in Gaussian Processes do?
They reduce the computational complexity by approximating the full GP model.
What is a common kernel function used in Gaussian Processes?
Squared exponential (Gaussian/RBF) kernel.
What is the relationship between Gaussian Processes and Bayesian linear regression in terms of computational complexity?
GP requires Θ(n^3), while Bayesian linear regression requires Θ(nd^2).
What is the computational cost of Bayesian linear regression?
𝑂(𝑛 𝑚^2 + 𝑚^3) instead of 𝑂(𝑛^3)
This refers to the cost of using a low-dimensional feature map to approximate the true kernel function.
What are the basic approaches for accelerating Gaussian Processes (GP)?
- Exploiting parallelism (GPU computations)
- Local GP methods
- Kernel function approximations (RFFs, QFFs, …)
- Inducing point methods (SoR, FITC, VFE etc.)
True or False: Fast GP methods do not address the cubic scaling in n.
True
Fast GP methods yield substantial speedup but still face cubic scaling issues.
What is a shift-invariant kernel?
A kernel 𝑘(𝑥, 𝑥’) is called shift-invariant if 𝑘(𝑥, 𝑥’) = 𝑘(𝑥 − 𝑥’)
What is the Fourier transform of a shift-invariant kernel?
𝑘(𝑥 − 𝑥’) = K(𝑝(𝜔)𝑒^{-𝜎𝑥’})
This relates the kernel to its frequency representation.
What is the key idea behind Random Fourier Features?
Interpret kernel as expectation: 𝑘(𝑥 - 𝑥’) = E[cos(𝜔𝑥 + 𝑏)cos(𝜔𝑥’)]
What is the performance theorem for Random Fourier Features?
For compact subset M with diameter diam(M), the probability bound holds: Pr[sup |𝑧(𝑥) − 𝑘(𝑥, 𝑥’)| ≥ 𝜖] ≤ 2d exp(-diam(M)^2 / (𝜖^2 * (d + 2)))
What is the primary function of inducing point methods?
Summarize data via function values of f at a set of m inducing points.
Fill in the blank: A kernel 𝑘(𝑥, 𝑥’) for 𝑥, 𝑥’ ∈ ℝ+ is called ______ if 𝑘(𝑥, 𝑥’) = 𝑘(𝑥 − 𝑥’).
shift-invariant
What are examples of stationary kernels?
- Gaussian
- Exponential
- Cauchy
True or False: The Gaussian kernel has the Fourier transform that is the standard Gaussian distribution in d dimensions.
True
What do local GP methods exploit?
Covariance functions that decay with distance of points.
What is the outcome of applying Bayesian linear regression with explicit feature maps?
It approximates Gaussian Processes.
What is the computational cost of Gaussian Processes inference?
Requires solving linear systems.
What modern GP libraries implement parallel GP inference?
- GPflow
- GPyTorch
What is the main idea behind kernel function approximation?
Construct explicit low-dimensional feature map that approximates the true kernel function.
What is the relationship between inducing points and data summarization?
Inducing points help to summarize data by function values at a set of inducing points.
What does the theorem by Bochner state regarding shift-invariant kernels?
A shift-invariant kernel is positive definite if and only if p(𝜔) is nonnegative.
What does SoR stand for in the context of Gaussian processes?
Subset of Regressors
What does the SoR approximation replace in the Gaussian process training conditional?
𝑝 𝒇 𝒖 = 𝑁𝑲𝒇,𝒖𝑲’𝒖,𝟏𝒖𝒖, 𝑲𝒇,𝒇 − 𝑸𝒇,𝒇
What is the resulting model from the SoR approximation?
A degenerate GP with covariance function
What does FITC stand for in Gaussian processes?
Fully independent training conditional
What does the FITC approximation replace in the Gaussian process training conditional?
𝑝 𝒇 𝒖 = 𝑁𝑲𝒇,𝒖𝑲’𝒖,𝟏𝒖𝒖, 𝑲𝒇,𝒇 − 𝑸𝒇,𝒇
What is the computational cost for inducing point methods SoR and FITC dominated by?
The cost of inverting 𝐾𝒖,𝒖
What is the computational cost’s relationship to the number of inducing points and data points?
Cubic in the number of inducing points, linear in the number of data points
What are some methods for picking inducing points?
- Chosen randomly
- Chosen greedily according to some criterion (e.g., variance)
- Equally spaced in the domain
- Random points
- Deterministic grid
How can inducing points be optimized?
By treating 𝒖 as hyperparameters and maximizing marginal likelihood of the data
What must be ensured about the inducing points 𝒖?
They must be representative of the data and where predictions are made
What is the relationship between Gaussian processes and Bayesian Linear Regression?
Gaussian processes = kernelized Bayesian Linear Regression
What can be computed in closed form with Gaussian processes?
Marginals / conditionals
How are hyperparameters optimized in Gaussian processes?
By maximizing the marginal likelihood
What exists for fast approximations to exact Gaussian process inference?
Various fast approximations
Which chapters of ‘Gaussian Processes for ML’ by Rasmussen & Williams should be read?
- Chapter 2: 2.1.1-2.3
- Chapter 4: up to 4.2
Which paper provides a unifying view of sparse approximate Gaussian process regression?
Quiñonero-Candela & Rasmussen: ‘A Unifying View of Sparse Approximate Gaussian Process Regression’, JMLR 2005
Which paper discusses random features for large-scale kernel machines?
Rahimi & Recht: ‘Random Features for Large-Scale Kernel Machines’, NeurIPS 2007