11 Convex Optimization Flashcards
Extreme Values
If π β Rπ is open and π: π β R is a twice continuously differentiable function of
multiple variables π₯ = (π₯1, β¦ , π₯π), then
* π»π(π₯β²) = 0 is aβ¦
* The symmetric matrix π» π₯ = ππ2 π₯ (Hessian) exists. ππ₯πππ₯π i,j
* π»π(π₯β²) = 0 and - π» π₯β² is β¦ (or π» π₯β² negative definite) is a sufficient condition that π₯β² is a local maximum.
negative definiteββ¦
positive definite ββ¦
indefiniteβEigenvalues < 0 and > 0
positive semidefinite β Eigenvalues β₯ 0 β inconclusive
necessary condition for π₯β² to be a local minimum or maximum.
positive definite
Eigenvalues < 0β> maximum
Eigenvalues > 0 β> minimum
saddle point
Write down the condition of convex functions btw :)
Momentum gradient descent: WHY and HOW?
Gradient descent keeps no memory of the past.
This can lead to inefficiencies.
X of k+1 = X of k + V of k
V of K = -alpha* delta f of (x of k) + constant * V of k-1
Stochastic versus Batch gradient descent
In batch gradient descent, the gradient is determined based on the average of the gradients
for all π data points in a data set. An iteration or gradient update considers the entire data set.
The difference to vanilla gradient descent is that the gradient is taken for a randomly chosen data point π. The parameter vector π₯ π+1 is then chosen for that specific data pointβs gradient.
How did we deal with high-dimensional regression problems and many (possibly irrelevant) variables?
Subset Selection Methods
Find the global optimal model, i.e. the best subset of independent variables: Best subset regression (too computationally expensive)
Greedy search for the optimal model (practical): * Forward stepwise selection
β Begin with empty set, and sequentially adds predictors * Backward stepwise selection
β Begin with full model, and sequentially deletes predictors
* Stepwise selection: combination of forward and backward move
Shrinkage (Regularization) Methods
The subset selection methods use OLS to fit a linear model that contains a subset of the predictors.
As an alternative, we can fit a model containing all p predictors using a technique that β¦
It may not be immediately obvious why such a constraint should improve the fit, but it turns out that β¦
Gradient descent can be used to find β¦.
Why regularization?
constrains or regularizes the coefficient estimates (i.e. shrinks the coefficient estimates towards zero).
shrinking the coefficient estimates can significantly reduce their variance.
the parameters of regularized OLS estimators.
to find biased estimators with smaller MSE, good trade-off of bias-variance-> ridge/lasso/subset selection.
Ridge regression
Write down the formula.
SSR and penalty are bothβ¦
As π increases, the β¦
Thus, when π is extremely large, then β¦. Best to apply ridge after standardizing the predictors (why?)
Cross-check with slide 36. CONVEX.
standardized ridge regression coefficients shrink to zero.
all of the ridge coefficient estimates are basically zero
,i.e. null model that contains no predictors.
How to select tuning parameter lambda in ridge regression?
ridge regression estimates will be β¦ than the OLS ones but have lower variance. This means,β¦
Ridge regression will work best in situations where β¦
Difference with Lasso?
Select a grid of potential values; use cross-validation to estimate the error rate on test data (for each value of Ξ») and select the value that gives the smallest error rate.
Finally, the model is re-fit using all of the variable observations and the selected value of the parameter Ξ».
more biased; it does not fit the training data as well as the OLS estimator, but might do better on unseen test data.
the OLS estimates have high variance.
Similar ideas can be applied to logistic regression.
Penalizes absolute value of Γ instead of Γ^2β> Lasso effect of forcing some of the coefficients to be exactly equal to zero when the tuning parameter Ξ» is sufficiently large. Feature selection method. More interpretable models.
Lasso vs. Ridge Regression
The lasso has a major advantage over ridge regression, in that it pβ¦.
The lasso leads to qualitatively similar behavior to ridge regression, in that asβ¦
The lasso can generate β¦
Cross-validation can be used in order to determine whichβ¦
Subgradient methods can be used to compute regression coefficients with an l1 regularizer in lasso. Proximal gradient methods are even more effective.
produces simpler and more interpretable models that involve only a subset of predictors.
Ξ» increases, the variance decreases and the bias increases.
more accurate predictions compared to ridge regression.
approach is better on a particular data set.
How did we compute the parameters of a logistic regression?
Go through the maths on slide 43-49 on your own. Independently as much as possible.
Perceptron Error
We canβt do gradient descent (see next class) with step functions, as β¦
ππ₯ =max(0,π₯)
it is non-differentiable at one point and otherwise the gradient is zero.
is ReLU activation function
Sigmoid leads to greater than zero error on correct classifications
e) In our example, the training data are perfectly linearly separable. This is not always the case (e.g. assume we add point (8, 2) with class yi = 1 to the training data) and the optimization problem becomes infeasible. Which techniques could you use to extend SVMs to such data?
A classic approach is to penalize points on the βwrong sideβ of the hyperplane by some factor p in the objective function. One such modified objective function would be:
over (zβ₯0,Ξ²,Ξ²0)min β₯Ξ²β₯+pXzi subject to yi(Ξ²Txi +Ξ²0)+zi β₯1
Alternatively, one could use other regularization techniques, or non-linear SVM classifiers (using the
so-called kernel trick).
Review SVMs tutorial before exam!