Reducing Loss, Regularization, Classification Flashcards
feature
input variable—the x variable in simple linear regression. A simple machine learning project might use a single feature, while a more sophisticated machine learning project could use millions of features.
example
example is a particular instance of data, x. (We put x in boldface to indicate that it is a vector.) We break examples into two categories:
labeled examples
unlabeled examples
A labeled example includes both feature(s) and the label. That is:
labeled examples: {features, label}: (x, y)
In our spam detector example, the labeled examples would be individual emails that users have explicitly marked as “spam” or “not spam.”
Training
Training means creating or learning the model. That is, you show the model labeled examples and enable the model to gradually learn the relationships between features and label.
Inference
Inference means applying the trained model to unlabeled examples. That is, you use the trained model to make useful predictions (y’). For example, during inference, you can predict medianHouseValue for new unlabeled examples.
Regression vs. classification
A regression model predicts continuous values. A classification model predicts discrete values
empirical risk minimization
In supervised learning, a machine learning algorithm builds a model by examining many examples and attempting to find a model that minimizes loss; this process is called empirical risk minimization.
loss
loss is a number indicating how bad the model’s prediction was on a single example
L2 loss
The squared loss for a single example is as follows:
= the square of the difference between the label and the prediction
= (observation - prediction(x))^2
= (y - y’)^2
MSE
Mean Squared Error
Mean square error (MSE) is the average squared loss per example over the whole dataset. To calculate MSE, sum up all the squared losses for individual examples and then divide by the number of examples:
1/N *sum(y-pred(x))^2
What is a convex problem?
Fuction has the shape of a bowl
Are neural nets convex?
No. There is more than one minimum.
Mini-Batch Gradient Descent
Mini-batch stochastic gradient descent (mini-batch SGD) is a compromise between full-batch iteration and SGD. A mini-batch is typically between 10 and 1,000 examples, chosen at random. Mini-batch SGD reduces the amount of noise in SGD but is still more efficient than full-batch. Here the loss & gradients are averaged over the batches..
When has the ML model converged?
Usually, you iterate until overall loss stops changing or at least changes extremely slowly. When that happens, we say that the model has converged.
What is the gradient of a function?
The gradient of a function, denoted as follows, is the vector of partial derivatives with respect to all of the independent variables:
Where does the gradient point to?
Points in the direction of greatest increase of the function. The gradient always points in the direction of steepest increase in the loss function.
Where does the negative gradient point to?
Points in the direction of greatest decrease of the function.
How are the gradient and the loss function connected?
We often have a loss function of many variables that we are trying to minimize, and we try to do this by following the negative of the gradient of the function.
What characteristics does a gradient have?
a direction
a magnitude
gradient descent algorithm
The gradient descent algorithm takes a step in the direction of the negative gradient in order to reduce loss as quickly as possible.
learning rate (also sometimes called step size)
Gradient descent algorithms multiply the gradient by a scalar known as the learning rate (also sometimes called step size) to determine the next point.
Hyperparameters
Hyperparameters are the knobs that programmers tweak in machine learning algorithms.
How to find a fitting learning rate for the gradient?
The Goldilocks value is related to how flat the loss function is. If you know the gradient of the loss function is small then you can safely try a larger learning rate, which compensates for the small gradient and results in a larger step size.
batch
a batch is the total number of examples you use to calculate the gradient in a single iteration
Stochastic gradient descent (SGD)
What if we could get the right gradient on average for much less computation? By choosing examples at random from our data set, we could estimate (albeit, noisily) a big average from a much smaller one. Stochastic gradient descent (SGD) takes this idea to the extreme–it uses only a single example (a batch size of 1) per iteration. Given enough iterations, SGD works but is very noisy. The term “stochastic” indicates that the one example comprising each batch is chosen at random.
Basic assumptions of ML
- We draw examples independently and identically (i.i.d) at random from the distribution
- The distribution is stationary: it doesnt change over time
- We always pull from the same distribution: Including training, validation, and test sets
generalization bounds
a statistical description of a model’s ability to generalize to new data based on factors such as:
- the complexity of the model
- the model’s performance on training data
What does i.i.d. basically mean?
That examples don’t influence each other. Randomness of variables
What is a violation of stationarity?
Consider a data set that contains retail sales information for a year. User’s purchases change seasonally, which would violate stationarity.
How to use a validation set?
You train your modle on the training set and the evaulate the model on the validation set. You tweak your model according to the results on the validation set. Then you pick the model that does best on the validation set. You then confirm your results on the test set.