Reducing Loss, Regularization, Classification Flashcards

1
Q

feature

A

input variable—the x variable in simple linear regression. A simple machine learning project might use a single feature, while a more sophisticated machine learning project could use millions of features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

example

A

example is a particular instance of data, x. (We put x in boldface to indicate that it is a vector.) We break examples into two categories:

labeled examples
unlabeled examples
A labeled example includes both feature(s) and the label. That is:

labeled examples: {features, label}: (x, y)

In our spam detector example, the labeled examples would be individual emails that users have explicitly marked as “spam” or “not spam.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Training

A

Training means creating or learning the model. That is, you show the model labeled examples and enable the model to gradually learn the relationships between features and label.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Inference

A

Inference means applying the trained model to unlabeled examples. That is, you use the trained model to make useful predictions (y’). For example, during inference, you can predict medianHouseValue for new unlabeled examples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Regression vs. classification

A

A regression model predicts continuous values. A classification model predicts discrete values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

empirical risk minimization

A

In supervised learning, a machine learning algorithm builds a model by examining many examples and attempting to find a model that minimizes loss; this process is called empirical risk minimization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

loss

A

loss is a number indicating how bad the model’s prediction was on a single example

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

L2 loss

A

The squared loss for a single example is as follows:

= the square of the difference between the label and the prediction
= (observation - prediction(x))^2
= (y - y’)^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

MSE

A

Mean Squared Error
Mean square error (MSE) is the average squared loss per example over the whole dataset. To calculate MSE, sum up all the squared losses for individual examples and then divide by the number of examples:

1/N *sum(y-pred(x))^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a convex problem?

A

Fuction has the shape of a bowl

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Are neural nets convex?

A

No. There is more than one minimum.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Mini-Batch Gradient Descent

A

Mini-batch stochastic gradient descent (mini-batch SGD) is a compromise between full-batch iteration and SGD. A mini-batch is typically between 10 and 1,000 examples, chosen at random. Mini-batch SGD reduces the amount of noise in SGD but is still more efficient than full-batch. Here the loss & gradients are averaged over the batches..

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

When has the ML model converged?

A

Usually, you iterate until overall loss stops changing or at least changes extremely slowly. When that happens, we say that the model has converged.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the gradient of a function?

A

The gradient of a function, denoted as follows, is the vector of partial derivatives with respect to all of the independent variables:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Where does the gradient point to?

A

Points in the direction of greatest increase of the function. The gradient always points in the direction of steepest increase in the loss function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Where does the negative gradient point to?

A

Points in the direction of greatest decrease of the function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How are the gradient and the loss function connected?

A

We often have a loss function of many variables that we are trying to minimize, and we try to do this by following the negative of the gradient of the function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What characteristics does a gradient have?

A

a direction

a magnitude

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

gradient descent algorithm

A

The gradient descent algorithm takes a step in the direction of the negative gradient in order to reduce loss as quickly as possible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

learning rate (also sometimes called step size)

A

Gradient descent algorithms multiply the gradient by a scalar known as the learning rate (also sometimes called step size) to determine the next point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Hyperparameters

A

Hyperparameters are the knobs that programmers tweak in machine learning algorithms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How to find a fitting learning rate for the gradient?

A

The Goldilocks value is related to how flat the loss function is. If you know the gradient of the loss function is small then you can safely try a larger learning rate, which compensates for the small gradient and results in a larger step size.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

batch

A

a batch is the total number of examples you use to calculate the gradient in a single iteration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Stochastic gradient descent (SGD)

A

What if we could get the right gradient on average for much less computation? By choosing examples at random from our data set, we could estimate (albeit, noisily) a big average from a much smaller one. Stochastic gradient descent (SGD) takes this idea to the extreme–it uses only a single example (a batch size of 1) per iteration. Given enough iterations, SGD works but is very noisy. The term “stochastic” indicates that the one example comprising each batch is chosen at random.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Basic assumptions of ML

A
  1. We draw examples independently and identically (i.i.d) at random from the distribution
  2. The distribution is stationary: it doesnt change over time
  3. We always pull from the same distribution: Including training, validation, and test sets
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

generalization bounds

A

a statistical description of a model’s ability to generalize to new data based on factors such as:

  • the complexity of the model
  • the model’s performance on training data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What does i.i.d. basically mean?

A

That examples don’t influence each other. Randomness of variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is a violation of stationarity?

A

Consider a data set that contains retail sales information for a year. User’s purchases change seasonally, which would violate stationarity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

How to use a validation set?

A

You train your modle on the training set and the evaulate the model on the validation set. You tweak your model according to the results on the validation set. Then you pick the model that does best on the validation set. You then confirm your results on the test set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What are the properites of a good feature?

A
  • ## Feature values should appear with non-zero values more than a small handful of times in the dataset.
31
Q

What is scaling?

A

Scaling means converting floating-point feature values from their natural range (for example, 100 to 900) into a standard range (for example, 0 to 1 or -1 to +1).

32
Q

If, however, a feature set consists of multiple features, then feature scaling provides the following benefits:

A
  • Helps gradient descent converge more quickly
  • Helps avoid the NaN trap
  • Helps the model learn appropriate weights for each feature.
33
Q

What is regularization?

A

Penalizing complex models

34
Q

What should we do instead of aiming to minimize loss?

A

Instead of empirical risk minimization we should do structural risk minimization. Which means we should minimize (loss + complexity).

35
Q

What effect does L2 (Ridge) regularization have on a model?

A
  • encourages weight values toward zero. But not exactly zero.
  • encourages the mean of the weights toward zero, weith a normal (bell-shaped or Gaussian distribution).
36
Q

What if lambda is too high?

A

the model will be simple, but you run the risk of underfitting your data.

37
Q

What does the sigmoid do?

A

It sives us a value between 0 and 1

38
Q

Is regularization important for logistic regression?

A

Yes, it is super important. Otherwise the algorithm will try to drive the loss to 0 in high dimensions and overfit the data.

39
Q

What is good about logistic regression? (computation)

A

Very fast regarding training and predicion times.

40
Q

Which strategies can regularization use to dampen model complexity?

A

L2 or L1 regularization,

early stopping so limiting the number of training steps or the learning rate

41
Q

What is the purpose of a threshold in logistic regression?

A

We use a threshold for discrete binary classificaiton. E.g. instance is positive = 1 when probability exceeds .8

We must tune it.

42
Q

What is accuracy?

A

The fraction of predictions we got right

Number of correct prediction/ Total number of predictions

43
Q

When does accuracy fail?

A

When different kinds of mistakes have different costs. Typical cases include class imbalance, when positives or negatives are extremely rare

44
Q

Precision

A

When the model said “positive” class, was it right?
Intuition: Did the model cry “wolf” too often?

TP/(TP + FP)

45
Q

Recall/ sensitivity

A

Out of all the possible positives, how many did the model correctly identify.
Intuition: Did it miss any wolves?
TP/(TP + FN)

46
Q

Specificity

A

Ratio of true negatives to total negatives.

47
Q

Consider a classification model that separates email into two categories: “spam” or “not spam.” If you raise the classification threshold, what will happen to precision?

A

Probably increase.

In general, raising the classification threshold reduces false positives, thus raising precision.

Raising the classification threshold typically increases precision; however, precision is not guaranteed to increase monotonically as we raise the threshold.

48
Q

Why do we use the ROC Curve?

A

Because if we chose a classificaiton threshold for Logisitic regression then we can calculate the recall and precision but we dont know the value across all possible thresholds.

-> The ROC shows them across all possible thresholds

49
Q

What does the AUC (Area under the ROC Curve) tell us?

A

If we pick a random positive and a random negative, what’s the probability my model ranks them in the correct order? (Assigns the correct label)

Intuition: gives an aggregate measure of performance aggregated across all possible classification thresholds.

50
Q

What is prediction bias in logistic regression?

A

We calculate is by comparing the average of all predictions to the average of all observations.

51
Q

A model that produces no false negatives has which metric = 1?

A

Recall

52
Q

A model that produces no false positives has which metric = 1?

A

Precision.

53
Q

Does recall increase or decrease when we lower the classification threshold?

A
  • more false positives
  • less false negatives

– > recall increases

54
Q

In the game of roulette, a ball is dropped on a spinning wheel and eventually lands in one of 38 slots. Using visual features (the spin of the ball, the position of the wheel when the ball was dropped, the height of the ball over the wheel), an ML model can predict the slot that the ball will land in with an accuracy of 4%.

A

This ML model is making predictions far better than chance; a random guess would be correct 1/38 of the time—yielding an accuracy of 2.6%. Although the model’s accuracy is “only” 4%, the benefits of success far outweigh the disadvantages of failure.

55
Q

What is the true positive rate (TPR)?

A

The same as recall

TP/(TP + FN)

56
Q

What is the False Positive Rate (FPR)?

A

FP/ (FP + TN)

Predicted FP / Actual negatives

57
Q

Which model has an AUC of 1?

A

One that ranks all predictions correct.

58
Q

Why is AUC desirable?

A
  • AUC is scale-invariant. It measures how well predictions are ranked, rather than their absolute values.
  • AUC is classification-threshold-invariant. It measures the quality of the model’s predicitons irrespective of what classification threshold is chosen.

-> but these can also be caveats.

Scale invariance is not always desirable. For example, sometimes we really do need well calibrated probability outputs, and AUC won’t tell us about that.

Classification-threshold invariance is not always desirable. In cases where there are wide disparities in the cost of false negatives vs. false positives, it may be critical to minimize one type of classification error. For example, when doing email spam detection, you likely want to prioritize minimizing false positives (even if that results in a significant increase of false negatives). AUC isn’t a useful metric for this type of optimization.

59
Q

What does a ROC curve do well that has an AUC of 1?

A

It ranks all positives above all negatives.

60
Q

How would multiplying all of the predictions from a given model by 2.0 (for example, if the model predicts 0.4, we multiply by 2.0 to get a prediction of 0.8) change the model’s performance as measured by AUC?

A

No change. AUC only cares about relative prediction scores.
Yes, AUC is based on the relative predictions, so any transformation of the predictions that preserves the relative ranking has no effect on AUC. This is clearly not the case for other metrics such as squared error, log loss, or prediction bias (discussed later).

61
Q

What does a nonzero prediction bias tell you?

A

A significant nonzero prediction bias tells you there is a bug somewhere in your model, as it indicates that the model is wrong about how frequently positive labels occur.

62
Q

Possible root causes of prediction bias are:

A
Incomplete feature set
Noisy data set
Buggy pipeline
Biased training sample
Overly strong regularization
63
Q

Why can zeroing out features be useful?

A

It can save RAM and reduce noise in the model

64
Q

What does L1 penalize?

A

The absolute weight of each coefficient

65
Q

What does L2 penalize?

A

weight squared of the coefficients

66
Q

What is the derivative of L2?

A

The derivative of L2 is 2 * weight.

67
Q

What is the derivative of L1?

A

The derivative of L1 is k (whose value is independent of weight)

68
Q

How can you think of the derivative of L2?

A

You can think of the derivative of L2 as a force that removes x% of the weight every time. As Zeno knew, even if you remove x percent of a number billions of times, the diminished number will still never quite reach zero. (Zeno was less familiar with floating-point precision limitations, which could possibly produce exactly zero.) At any rate, L2 does not normally drive weights to zero.

69
Q

How can you think of the derivative of L1?

A

You can think of the derivative of L1 as a force that subtracts some constant from the weight every time. However, thanks to absolute values, L1 has a discontinuity at 0, which causes subtraction results that cross 0 to become zeroed out. For example, if subtraction would have forced a weight from +0.1 to -0.2, L1 will set the weight to exactly 0. Eureka, L1 zeroed out the weight.

70
Q

L1 regularization may cause informative features to get a weight of exactly 0.0.

A

Be careful–L1 regularization may cause the following kinds of features to be given weights of exactly 0:
Weakly informative features.
Strongly informative features on different scales.
Informative features strongly correlated with other similarly informative features.

71
Q

When is model non-linear?

A

When you can’t accurately predict a label with a model of the form b + w1x1 + w2x2

72
Q

How is the non-linear function called in NN?

A

Activation function

73
Q

What is ReLU?

A

Rectified linear unit activation function. It works better than a smooth function like the sigmoid, while also being easier to compute.
F(x) = max(0,x)

74
Q

What is collaborative filtering?

A

task of making predictions about the interest of a user based on interest of many other users.