Test Flashcards by Mabel Wylie

How well did you know this?

Not at all

Perfectly

What is a Gaussian Process (GP)?

An (infinite) set of random variables indexed by some set X

For each x ∈ X, there’s a random variable f such that for all A ⊆ X, A = {x1,…,xm}, f_A ~ N(μ_A, K_A).

How well did you know this?

Not at all

Perfectly

What does the likelihood function P(data | f) represent?

The probability of observing the data given the function f.

How well did you know this?

Not at all

Perfectly

What is the posterior distribution P(f | data)?

The probability of the function f given the observed data.

How well did you know this?

Not at all

Perfectly

What are the components of a Gaussian Process?

Mean function (µ)
Covariance (kernel) function (k)

How well did you know this?

Not at all

Perfectly

What is the formula for the predictive distribution in Gaussian Processes?

f | x1,…,xm, y1,…,ym = GP(f; μ0, k0)

How well did you know this?

Not at all

Perfectly

What is the closed form formula for prediction in Gaussian Processes?

μ* = μ(x) + k,A K^-1(yA - μA)
k = k(x, x’) - k,A K^-1 kA,

How well did you know this?

Not at all

Perfectly

What is the purpose of optimizing kernel parameters in Gaussian Processes?

To improve predictive performance.

How well did you know this?

Not at all

Perfectly

What is one method for optimizing hyperparameters in Gaussian Processes?

Cross-validation on predictive performance.

How well did you know this?

Not at all

Perfectly

What is the Bayesian perspective on optimizing hyperparameters?

Maximize the marginal likelihood of the data.

How well did you know this?

Not at all

Perfectly

What does maximizing marginal likelihood help with in Gaussian Processes?

It helps guard against overfitting.

How well did you know this?

Not at all

Perfectly

What is an Empirical Bayes method?

Estimating a prior distribution from data by maximizing marginal likelihood.

How well did you know this?

Not at all

Perfectly

What computational cost is associated with prediction using Gaussian Processes?

Θ(n^3) due to solving linear systems.

How well did you know this?

Not at all

Perfectly

What are some basic approaches for accelerating Gaussian Process computations?

Exploiting parallelism (GPU computations)
Local GP methods
Kernel function approximations
Inducing point methods

How well did you know this?

Not at all

Perfectly

True or False: The posterior covariance k’ in Gaussian Processes depends on the observed data yA.

False.

How well did you know this?

Not at all

Perfectly

Fill in the blank: The covariance function in Gaussian Processes is also known as the ______.

[kernel function]

How well did you know this?

Not at all

Perfectly

What is the effect of kernel parameters in Gaussian Processes?

They influence the shape and smoothness of the function being modeled.

How well did you know this?

Not at all

Perfectly

What is the significance of the mean function in a Gaussian Process?

It represents the expected value of the function at each point.

How well did you know this?

Not at all

Perfectly

What do inducing point methods in Gaussian Processes do?

They reduce the computational complexity by approximating the full GP model.

How well did you know this?

Not at all

Perfectly

What is a common kernel function used in Gaussian Processes?

Squared exponential (Gaussian/RBF) kernel.

How well did you know this?

Not at all

Perfectly

What is the relationship between Gaussian Processes and Bayesian linear regression in terms of computational complexity?

GP requires Θ(n^3), while Bayesian linear regression requires Θ(nd^2).

How well did you know this?

Not at all

Perfectly

What is the computational cost of Bayesian linear regression?

𝑂(𝑛 𝑚^2 + 𝑚^3) instead of 𝑂(𝑛^3)

This refers to the cost of using a low-dimensional feature map to approximate the true kernel function.

How well did you know this?

Not at all

Perfectly

What are the basic approaches for accelerating Gaussian Processes (GP)?

Study These Flashcards

Exploiting parallelism (GPU computations)
Local GP methods
Kernel function approximations (RFFs, QFFs, …)
Inducing point methods (SoR, FITC, VFE etc.)

True or False: Fast GP methods do not address the cubic scaling in n.

Study These Flashcards

True

Fast GP methods yield substantial speedup but still face cubic scaling issues.

What is a shift-invariant kernel?

A kernel 𝑘(𝑥, 𝑥’) is called shift-invariant if 𝑘(𝑥, 𝑥’) = 𝑘(𝑥 − 𝑥’)

What is the Fourier transform of a shift-invariant kernel?

𝑘(𝑥 − 𝑥’) = K(𝑝(𝜔)𝑒^{-𝜎𝑥’}) ## Footnote This relates the kernel to its frequency representation.

What is the key idea behind Random Fourier Features?

Interpret kernel as expectation: 𝑘(𝑥 - 𝑥’) = E[cos(𝜔𝑥 + 𝑏)cos(𝜔𝑥’)]

What is the performance theorem for Random Fourier Features?

For compact subset M with diameter diam(M), the probability bound holds: Pr[sup |𝑧(𝑥) − 𝑘(𝑥, 𝑥’)| ≥ 𝜖] ≤ 2d exp(-diam(M)^2 / (𝜖^2 * (d + 2)))

What is the primary function of inducing point methods?

Summarize data via function values of f at a set of m inducing points.

Fill in the blank: A kernel 𝑘(𝑥, 𝑥’) for 𝑥, 𝑥’ ∈ ℝ+ is called ______ if 𝑘(𝑥, 𝑥’) = 𝑘(𝑥 − 𝑥’).

shift-invariant

What are examples of stationary kernels?

* Gaussian * Exponential * Cauchy

True or False: The Gaussian kernel has the Fourier transform that is the standard Gaussian distribution in d dimensions.

True

What do local GP methods exploit?

Covariance functions that decay with distance of points.

What is the outcome of applying Bayesian linear regression with explicit feature maps?

It approximates Gaussian Processes.

What is the computational cost of Gaussian Processes inference?

Requires solving linear systems.

What modern GP libraries implement parallel GP inference?

* GPflow * GPyTorch

What is the main idea behind kernel function approximation?

Construct explicit low-dimensional feature map that approximates the true kernel function.

What is the relationship between inducing points and data summarization?

Inducing points help to summarize data by function values at a set of inducing points.

What does the theorem by Bochner state regarding shift-invariant kernels?

A shift-invariant kernel is positive definite if and only if p(𝜔) is nonnegative.

What does SoR stand for in the context of Gaussian processes?

Subset of Regressors

What does the SoR approximation replace in the Gaussian process training conditional?

𝑝 𝒇 𝒖 = 𝑁𝑲𝒇,𝒖𝑲’𝒖,𝟏𝒖𝒖, 𝑲𝒇,𝒇 − 𝑸𝒇,𝒇

What is the resulting model from the SoR approximation?

A degenerate GP with covariance function

What does FITC stand for in Gaussian processes?

Fully independent training conditional

What does the FITC approximation replace in the Gaussian process training conditional?

𝑝 𝒇 𝒖 = 𝑁𝑲𝒇,𝒖𝑲’𝒖,𝟏𝒖𝒖, 𝑲𝒇,𝒇 − 𝑸𝒇,𝒇

What is the computational cost for inducing point methods SoR and FITC dominated by?

The cost of inverting 𝐾𝒖,𝒖

What is the computational cost's relationship to the number of inducing points and data points?

Cubic in the number of inducing points, linear in the number of data points

What are some methods for picking inducing points?

* Chosen randomly * Chosen greedily according to some criterion (e.g., variance) * Equally spaced in the domain * Random points * Deterministic grid

How can inducing points be optimized?

By treating 𝒖 as hyperparameters and maximizing marginal likelihood of the data

What must be ensured about the inducing points 𝒖?

They must be representative of the data and where predictions are made

What is the relationship between Gaussian processes and Bayesian Linear Regression?

Gaussian processes = kernelized Bayesian Linear Regression

What can be computed in closed form with Gaussian processes?

Marginals / conditionals

How are hyperparameters optimized in Gaussian processes?

By maximizing the marginal likelihood

What exists for fast approximations to exact Gaussian process inference?

Various fast approximations

Which chapters of 'Gaussian Processes for ML' by Rasmussen & Williams should be read?

* Chapter 2: 2.1.1-2.3 * Chapter 4: up to 4.2

Which paper provides a unifying view of sparse approximate Gaussian process regression?

Quiñonero-Candela & Rasmussen: 'A Unifying View of Sparse Approximate Gaussian Process Regression', JMLR 2005

Which paper discusses random features for large-scale kernel machines?

Rahimi & Recht: 'Random Features for Large-Scale Kernel Machines', NeurIPS 2007

Test Flashcards

(56 cards)