Regression Flashcards
What is the basic model for linear regression?
Y = f(X) + ε, where f is a linear function modeling E[Y|X], and ε is a noise term.
In Bayesian framework, how are parameters typically estimated?
Using the posterior distribution, often with the maximum a posteriori (MAP) estimate.
What is the ordinary least squares (OLS) estimate?
θ̂ = arg min_θ Σ(y_i - f(x_i))^2, which minimizes the squared error between predictions and observations.
How is the OLS solution calculated when X^T X has full rank?
θ̂ = (X^T X)^-1 X^T y
What is ridge regression and how does it differ from OLS?
Ridge regression adds a penalty term: θ̂(λ) = arg min_θ ||Xθ - y||^2 + λ||θ||^2, where λ is the regularization strength.
What is a kernel function?
A function κ(x_i, x_j) = φ(x_i)^T φ(x_j), where φ is a feature map.
What is the “kernel trick”?
The ability to compute κ(x_i, x_j) without explicitly computing φ(x).
Name three example kernel functions
Linear kernel, polynomial kernel, and radial basis function (RBF) kernel.
What are hyperparameters in kernel regression?
Parameters of the kernel function and the regularization strength λ (if ridge kernel regression)
What is a feature map in Kernel regression?
A feature map in Kernel regression transforms the input data into a higher-dimensional space to make it easier to find linear relationships.
What is the basic idea behind random features?
The basic idea behind random features in kernel regression is to approximate the kernel function using a finite set of random projections to reduce computational complexity.
What is the main limitation of kernel methods for large datasets?
The kernel matrix grows quadratically with the number of samples.
What is kernel regression suited for the best?
High-dimensional data points in moderate datasets.
What is the difference between parameters and hyperparameters in a model?
Parameters control the likelihood function, while hyperparameters parametrize the prior distribution in a Bayesian setting.
What is the Bayesian interpretation of the ridge regression penalty?
The penalty λ ||θ||^2 can be interpreted as a Gaussian prior.
In what case does ridge regression converge to the minimum ℓ2-norm OLS solution?
As λ approaches 0.
What is polynomial regression?
A form of regression where Y = θ_1 + θ_2X + θ_3X^2 + θ_4X^3 + … + ε
What is the general form of regression in feature space?
Y = φ(X)θ + ε, where φ is a feature map.
What is the kernel matrix K?
K = φ(X)φ(X)^T, or K_ij = κ(x_i, x_j)
How is prediction made in kernel regression?
ŷ = Σ_i κ(x_i, x_new)η̂_i, where x_i are training points and η̂ are estimated parameters. It requires the full training set.
What is the main advantage of random feature approximation?
: It allows kernel methods to be applied to large datasets by reducing computational complexity.
In what case should one consider using linear regression in feature space instead of kernel regression?
When there’s a small amount of features and the data is sparse in feature space.
if p > n, what happens to OLS estimate?
OLS estminate has infinitely many solutions, we take the theta with minimal length (minimal L2 norm solution, Lagrange problem)
What is Moore-Penrose pseudo inverse?
The Moore-Penrose pseudo inverse is a generalization of the matrix inverse that can be applied to non-square or singular matrices to solve linear least squares problems.
Benefits of feature maps?
Allows linear models to capture non-linear relationships
Can significantly improve model performance on complex datasets
Keeps the simplicity and interpretability of linear models
Drawbacks of feature maps
Can lead to overfitting if too many features are created
May increase computational complexity
Requires careful selection of appropriate feature maps