11 Convex Optimization Flashcards

Question 1

Q

Extreme Values
If 𝑋 ⊆ R𝑑 is open and 𝑓: 𝑋 → R is a twice continuously differentiable function of
multiple variables 𝑥 = (𝑥1, … , 𝑥𝑑), then
* 𝛻𝑓(𝑥′) = 0 is a…
* The symmetric matrix 𝐻 𝑥 = 𝜕𝑓2 𝑥 (Hessian) exists. 𝜕𝑥𝑖𝜕𝑥𝑗 i,j
* 𝛻𝑓(𝑥′) = 0 and - 𝐻 𝑥′ is … (or 𝐻 𝑥′ negative definite) is a sufficient condition that 𝑥′ is a local maximum.

negative definite→…
positive definite →…
indefinite→Eigenvalues < 0 and > 0
positive semidefinite → Eigenvalues ≥ 0 → inconclusive

Answer

A

necessary condition for 𝑥′ to be a local minimum or maximum.

positive definite

Eigenvalues < 0–> maximum
Eigenvalues > 0 –> minimum
saddle point

Write down the condition of convex functions btw :)

Question 2

Q

Momentum gradient descent: WHY and HOW?

Answer

A

Gradient descent keeps no memory of the past.
This can lead to inefficiencies.

X of k+1 = X of k + V of k
V of K = -alpha* delta f of (x of k) + constant * V of k-1

Question 3

Q

Stochastic versus Batch gradient descent

Answer

A

In batch gradient descent, the gradient is determined based on the average of the gradients
for all 𝑁 data points in a data set. An iteration or gradient update considers the entire data set.

The difference to vanilla gradient descent is that the gradient is taken for a randomly chosen data point 𝑖. The parameter vector 𝑥 𝑘+1 is then chosen for that specific data point‘s gradient.

Question 4

Q

How did we deal with high-dimensional regression problems and many (possibly irrelevant) variables?

Answer

A

Subset Selection Methods
Find the global optimal model, i.e. the best subset of independent variables: Best subset regression (too computationally expensive)
Greedy search for the optimal model (practical): * Forward stepwise selection
− Begin with empty set, and sequentially adds predictors * Backward stepwise selection
− Begin with full model, and sequentially deletes predictors
* Stepwise selection: combination of forward and backward move

Question 5

Q

Shrinkage (Regularization) Methods
The subset selection methods use OLS to fit a linear model that contains a subset of the predictors.
As an alternative, we can fit a model containing all p predictors using a technique that …
It may not be immediately obvious why such a constraint should improve the fit, but it turns out that …
Gradient descent can be used to find ….

Why regularization?

Answer

A

constrains or regularizes the coefficient estimates (i.e. shrinks the coefficient estimates towards zero).

shrinking the coefficient estimates can significantly reduce their variance.

the parameters of regularized OLS estimators.

to find biased estimators with smaller MSE, good trade-off of bias-variance-> ridge/lasso/subset selection.

Question 6

Q

Ridge regression

Write down the formula.

SSR and penalty are both…

As 𝜆 increases, the …
Thus, when 𝜆 is extremely large, then …. Best to apply ridge after standardizing the predictors (why?)

Answer

A

Cross-check with slide 36. CONVEX.

standardized ridge regression coefficients shrink to zero.

all of the ridge coefficient estimates are basically zero
,i.e. null model that contains no predictors.

Question 7

Q

How to select tuning parameter lambda in ridge regression?

ridge regression estimates will be … than the OLS ones but have lower variance. This means,…

Ridge regression will work best in situations where …

Difference with Lasso?

Answer

A

Select a grid of potential values; use cross-validation to estimate the error rate on test data (for each value of λ) and select the value that gives the smallest error rate.
Finally, the model is re-fit using all of the variable observations and the selected value of the parameter λ.

more biased; it does not fit the training data as well as the OLS estimator, but might do better on unseen test data.

the OLS estimates have high variance.
Similar ideas can be applied to logistic regression.

Penalizes absolute value of ß instead of ß^2–> Lasso effect of forcing some of the coefficients to be exactly equal to zero when the tuning parameter λ is sufficiently large. Feature selection method. More interpretable models.

Question 8

Q

Lasso vs. Ridge Regression
The lasso has a major advantage over ridge regression, in that it p….
The lasso leads to qualitatively similar behavior to ridge regression, in that as…
The lasso can generate …
Cross-validation can be used in order to determine which…
Subgradient methods can be used to compute regression coefficients with an l1 regularizer in lasso. Proximal gradient methods are even more effective.

Answer

A

produces simpler and more interpretable models that involve only a subset of predictors.

λ increases, the variance decreases and the bias increases.

more accurate predictions compared to ridge regression.

approach is better on a particular data set.

Question 9

Q

How did we compute the parameters of a logistic regression?

Answer

A

Go through the maths on slide 43-49 on your own. Independently as much as possible.

Question 10

Q

Perceptron Error
We can’t do gradient descent (see next class) with step functions, as …

𝑓𝑥 =max(0,𝑥)

Answer

A

it is non-differentiable at one point and otherwise the gradient is zero.

is ReLU activation function
Sigmoid leads to greater than zero error on correct classifications

Question 11

Q

e) In our example, the training data are perfectly linearly separable. This is not always the case (e.g. assume we add point (8, 2) with class yi = 1 to the training data) and the optimization problem becomes infeasible. Which techniques could you use to extend SVMs to such data?

Answer

A

A classic approach is to penalize points on the “wrong side” of the hyperplane by some factor p in the objective function. One such modified objective function would be:
over (z≥0,β,β0)min ∥β∥+pXzi subject to yi(βTxi +β0)+zi ≥1

Alternatively, one could use other regularization techniques, or non-linear SVM classifiers (using the
so-called kernel trick).

Review SVMs tutorial before exam!