Reducing Loss, Regularization, Classification Flashcards

Question

Basic assumptions of ML

Answer 1

1. We draw examples independently and identically (i.i.d) at random from the distribution 2. The distribution is stationary: it doesnt change over time 3. We always pull from the same distribution: Including training, validation, and test sets

Answer 2

a statistical description of a model's ability to generalize to new data based on factors such as: - the complexity of the model - the model's performance on training data

Answer 3

That examples don't influence each other. Randomness of variables

Answer 4

Consider a data set that contains retail sales information for a year. User's purchases change seasonally, which would violate stationarity.

Answer 5

You train your modle on the training set and the evaulate the model on the validation set. You tweak your model according to the results on the validation set. Then you pick the model that does best on the validation set. You then confirm your results on the test set.

Answer 6

- Feature values should appear with non-zero values more than a small handful of times in the dataset. -

Answer 7

Scaling means converting floating-point feature values from their natural range (for example, 100 to 900) into a standard range (for example, 0 to 1 or -1 to +1).

Answer 8

- Helps gradient descent converge more quickly - Helps avoid the NaN trap - Helps the model learn appropriate weights for each feature.

Answer 9

Penalizing complex models

Answer 10

Instead of empirical risk minimization we should do structural risk minimization. Which means we should minimize (loss + complexity).

Answer 11

- encourages weight values toward zero. But not exactly zero. - encourages the mean of the weights toward zero, weith a normal (bell-shaped or Gaussian distribution).

Answer 12

the model will be simple, but you run the risk of underfitting your data.

Answer 13

It sives us a value between 0 and 1

Answer 14

Yes, it is super important. Otherwise the algorithm will try to drive the loss to 0 in high dimensions and overfit the data.

Answer 15

Very fast regarding training and predicion times.

Answer 16

L2 or L1 regularization, | early stopping so limiting the number of training steps or the learning rate

Answer 17

We use a threshold for discrete binary classificaiton. E.g. instance is positive = 1 when probability exceeds .8 We must tune it.

Answer 18

The fraction of predictions we got right Number of correct prediction/ Total number of predictions

Answer 19

When different kinds of mistakes have different costs. Typical cases include class imbalance, when positives or negatives are extremely rare

Answer 20

When the model said "positive" class, was it right? Intuition: Did the model cry "wolf" too often? TP/(TP + FP)

Answer 21

Out of all the possible positives, how many did the model correctly identify. Intuition: Did it miss any wolves? TP/(TP + FN)

Answer 22

Ratio of true negatives to total negatives.

Answer 23

Probably increase. In general, raising the classification threshold reduces false positives, thus raising precision. Raising the classification threshold typically increases precision; however, precision is not guaranteed to increase monotonically as we raise the threshold.

Answer 24

Because if we chose a classificaiton threshold for Logisitic regression then we can calculate the recall and precision but we dont know the value across all possible thresholds. -> The ROC shows them across all possible thresholds

Answer 25

If we pick a random positive and a random negative, what's the probability my model ranks them in the correct order? (Assigns the correct label) Intuition: gives an aggregate measure of performance aggregated across all possible classification thresholds.

Answer 26

We calculate is by comparing the average of all predictions to the average of all observations.

Answer 27

Precision.

Answer 28

- more false positives - less false negatives -- > recall increases

Answer 29

This ML model is making predictions far better than chance; a random guess would be correct 1/38 of the time—yielding an accuracy of 2.6%. Although the model's accuracy is "only" 4%, the benefits of success far outweigh the disadvantages of failure.

Answer 30

The same as recall | TP/(TP + FN)

Answer 31

FP/ (FP + TN) Predicted FP / Actual negatives

Answer 32

One that ranks all predictions correct.

Answer 33

- AUC is scale-invariant. It measures how well predictions are ranked, rather than their absolute values. - AUC is classification-threshold-invariant. It measures the quality of the model's predicitons irrespective of what classification threshold is chosen. -> but these can also be caveats. Scale invariance is not always desirable. For example, sometimes we really do need well calibrated probability outputs, and AUC won’t tell us about that. Classification-threshold invariance is not always desirable. In cases where there are wide disparities in the cost of false negatives vs. false positives, it may be critical to minimize one type of classification error. For example, when doing email spam detection, you likely want to prioritize minimizing false positives (even if that results in a significant increase of false negatives). AUC isn't a useful metric for this type of optimization.

Answer 34

It ranks all positives above all negatives.

Answer 35

No change. AUC only cares about relative prediction scores. Yes, AUC is based on the relative predictions, so any transformation of the predictions that preserves the relative ranking has no effect on AUC. This is clearly not the case for other metrics such as squared error, log loss, or prediction bias (discussed later).

Answer 36

A significant nonzero prediction bias tells you there is a bug somewhere in your model, as it indicates that the model is wrong about how frequently positive labels occur.

Answer 37

``` Incomplete feature set Noisy data set Buggy pipeline Biased training sample Overly strong regularization ```

Answer 38

It can save RAM and reduce noise in the model

Answer 39

The absolute weight of each coefficient

Answer 40

weight squared of the coefficients

Answer 41

The derivative of L2 is 2 * weight.

Answer 42

The derivative of L1 is k (whose value is independent of weight)

Answer 43

You can think of the derivative of L2 as a force that removes x% of the weight every time. As Zeno knew, even if you remove x percent of a number billions of times, the diminished number will still never quite reach zero. (Zeno was less familiar with floating-point precision limitations, which could possibly produce exactly zero.) At any rate, L2 does not normally drive weights to zero.

Answer 44

You can think of the derivative of L1 as a force that subtracts some constant from the weight every time. However, thanks to absolute values, L1 has a discontinuity at 0, which causes subtraction results that cross 0 to become zeroed out. For example, if subtraction would have forced a weight from +0.1 to -0.2, L1 will set the weight to exactly 0. Eureka, L1 zeroed out the weight.

Answer 45

Be careful--L1 regularization may cause the following kinds of features to be given weights of exactly 0: Weakly informative features. Strongly informative features on different scales. Informative features strongly correlated with other similarly informative features.

Answer 46

When you can't accurately predict a label with a model of the form b + w1*x1 + w2*x2

Answer 47

Activation function

Answer 48

Rectified linear unit activation function. It works better than a smooth function like the sigmoid, while also being easier to compute. F(x) = max(0,x)

Answer 49

task of making predictions about the interest of a user based on interest of many other users.

Reducing Loss, Regularization, Classification Flashcards

(74 cards)