Reducing Loss Flashcards

1
Q

How do we choose a set of model params that minimize loss ?

A

One way is to compute the gradient.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are hyperparameters ?

A

the configuration settings used to tune how the model is trained

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why are Iterative Approaches for Reducing Loss so prevalent in ML ?

A

primarily because they scale so well to large data sets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does it mean when a Model has “Converged” ?

A

we’ve iterrates until overall loss stops changing or at least changes extremely slowly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How many minimums do Convex problems have ?

A

only one place where the slope is exactly 0. That minimum is where the loss function converges.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Gradient Descent used for In ML ?

A

algorithm for calculating the loss.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is differential calculus ?

A

studies the rates at which quantities change (vs integral calculus)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How are gradients used in ML ?

A

gradients are used in gradient descent to min loss. We often have a loss function of many variables that we are trying to minimize, and we try to do this by following the negative of the gradient of the function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are ML libraries such as TensorFlow used for ?

A

functions handles the mathematical computations (for example gradient descent)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Is redundant data bad ?

A

Some redundancy can be useful to smooth out noisy gradients, but enormous batches tend not to carry much more predictive value than large batches.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is Stochastic gradient descent (SGD) ?

A

it uses only a single example (a batch size of 1) per iteration, chosen randomly, to determine the average gradient.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Is SGD good ?

A

given enough iterations, it works but is noisy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is mini-batch SGD ?

A

10 to 1000 examples. reduces the amount of noise in SGD but is still more efficient than full-batch.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly