Quiz 5 Flashcards
True or False
In Support Vector Machines, we maximize (║w║^2)/2 subject to the margin constraints.
False
True or False
In kernelized SVMs, the kernel matrix K has to be positive definite.
True
True or False
If two random variables are independent, then they have to be uncorrelated.
True
True or False
Isocontours of Gaussian distributions have axes whose lengths are proportional to the eigenvalues of the
covariance matrix.
False
True or False
Cross validation will guarantee that our model does not overt.
False
True or False
In logistic regression, the Hessian of the (non regularized) log likelihood is positive denite.
False
Given a binary classification scenario with Gaussian class conditionals and equal prior probabilities, the
optimal decision boundary will be linear.
False
True or False
The hyperparameters in the regularized logistic regression model are η (learning rate) and λ (regularization
term).
False
The Bayes risk for a decision problem is zero when…
the class distributions P(X|Y ) do not overlap and the the prior probability for one class is 1.
Gaussian discriminant analysis…
models P(Y = y|X) as a logistic function, is an example of a generative model and can be used to classify points without ever computing an exponential.
Ridge regression…
reduces variance at the expense of higher bias.
Logistic regression…
minimizes a convex cost function and can be used with a polynomial kernel.
In least-squares linear regression, imposing a Gaussian prior on the weights is equivalent to…
L2 regularization
In terms of the bias-variance trade-off, which of the following is/are substantially more harmful to the test error than the training error?
Bias
Loss
Variance
Risk
Variance
In Gaussian discriminant analysis, if two classes come from Gaussian distributions that have different means, may or may not have different covariance matrices, and may or may not have different priors, what are some of the possible decision boundary shapes?
1) hyperplane
2) a nonlinear quadric surface (quadric = the isosurface of a quadratic function)
3) the empty set (the classifier always returns the same class)
Why might we prefer to minimize the sum of absolute residuals instead of the residual sum of squares for some data sets?
(Hint: What is one of the
flaws of least-squares regression?)
The sum of absolute residuals is less sensitive to outliers than the residual sum of squares.
You train a linear classifier on 10,000 training points and discover that the training accuracy is only 67%. Which
of the following, done in isolation, has a good chance of improving your training accuracy?
..Add novel features
..Train on more data
..Use linear regression
..Train on less data
Add novel features
Train on less data
In least-squares linear regression, adding a regularization term can…
increase training error, increase validation error and decrease validation error.
Recall that the data model, yi = f (Xi) + εi, that justifies the least-squares cost function in regression.
The statistical assumptions of this model, for all i, are…
εi comes from a Gaussian distribution, all εi have the same mean, and all yi have the same variance.
How does ridge regression compare to linear regression with respect to the bias-variance tradeoff?
Ridge regression usually has higher bias and ridge regression’s variance approaches zero as
the regularization parameter
λ → ∞
What following quantities affect the bias-variance tradeo-off?
λ, the regularization coefficient in ridge regression
C, the slack parameter in soft-margin SVM
d, the polynomial degree in least-squares regression
MLE, applied to estimate the mean parameter
of a normal distribution N(μ;Σ) with a known covariance
matrix Σ, returns…
the mean of the sample points.
Maximizing the log likelihood is equivalent to…
maximizing the likelihood.
What is the maximum number of points in the Bayes optimal decision boundary?
(Note: as the distribution is
discrete, we are really asking for the maximum number of integral values of k where the classifier makes a transition from predicting one class to the other.)
As f is linear in k, there is only one root, and the decision boundary is a single point.