Week 6: Optimization Flashcards

1
Q

Give examples of functions whose minimum point is not a stationary point. Draw graphs.

A

One discontinuous function and one bounded function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What defines a stationary point?

A

Its first derivative (gradient) is zero.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why would be not be able to use a kind of exponential function as an objective function?

A

As it has its minimum when f(x) goes to infinity and hence doesn’t have an interior minimum.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How come we can limit ourselves to minimization (and not also max.) when we do optimization?

A

Since we can always rewrite the maximization problem as minimizing the negative objective function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a typical rate for the learning rate gamma?

A

0.01 or 0.05.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Name the two ways in which optimization is used in machine learning.

A

1) For training a model. We optimize the objective function J(theta) and the optimization variables are the model parameters. 2) For tuning hyperparameters set before-hand of analysis. We then optimize the objective function with the hyperparameters as optimization variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why is a convex function a good function to optimize?

A

Since it has a unique, global minimum.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Give examples of convex cost functions?

A

The cost functions for linear, logistic regression and the the L1 regularised linear regression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Give an example of a non-convex function.

A

The cost function for a deep neural network.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

State the optimization problem i.e., the minimization of the cost function for linear regresson.

A

theta.hatt = arg min (theta) 1/n SUM [ ||*XX theta** - y ||] ^2_2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Why is coordinate descent particularly fast and efficient for optimizing an L1 regularized linear regression model? Since the model works the way that is sets many coefficients to zero, many of the updates in the coordinate descent will simply set theta_j = 0 due to the sparsity of the optimal theta.hat.

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly