2_Linear Regression Flashcards
Linear relationship
Using the equation for a line, you could write down this relationship as follows:
y=mx+b
where:
- y is the temperature in Celsius—the value we’re trying to predict.
- m is the slope of the line.
- x is the number of chirps per minute—the value of our input feature.
- b is the y-intercept.
By convention in machine learning, you’ll write the equation for a model slightly differently:
y′=b+w1x1
where:
- y′ is the predicted label (a desired output).
- b is the bias (the y-intercept), sometimes referred to as w0.
- w1 is the weight of feature 1. Weight is the same concept as the “slope” m in the traditional equation of a line.
- x1 is a feature (a known input).
Linear relationship
To infer (predict) the temperature y′ for a new chirps-per-minute value x1, just substitute the x1 value into this model.
Although this model uses only one feature, a more sophisticated model might rely on multiple features, each having a separate weight (w1, w2, etc.). For example, a model that relies on three features might look as follows:
y′=b+w1x1+w2x2+w3x3
Training and loss
Training a model simply means learning (determining) good values for all the weights and the bias from labeled examples. In supervised learning, a machine learning algorithm builds a model by examining many examples and attempting to find a model that minimizes loss; this process is called empirical risk minimization.
Loss is the penalty for a bad prediction. That is, loss is a number indicating how bad the model’s prediction was on a single example. If the model’s prediction is perfect, the loss is zero; otherwise, the loss is greater. The goal of training a model is to find a set of weights and biases that have low loss, on average, across all examples. For example, Figure 3 shows a high loss model on the left and a low loss model on the right. Note the following about the figure:
- The arrows represent loss.
- The blue lines represent predictions.
Notice that the arrows in the left plot are much longer than their counterparts in the right plot. Clearly, the line in the right plot is a much better predictive model than the line in the left plot.
You might be wondering whether you could create a mathematical function—a loss function—that would aggregate the individual losses in a meaningful fashion.
Squared loss: a popular function
The linear regression models we’ll examine here use a loss function called squared loss (also known as L2 loss). The squared loss for a single example is as follows:
- = the square of the difference between the label and the prediction
- = (observation - prediction(x))2
- = (y - y’)2
Mean square error (MSE) is the average squared loss per example over the whole dataset. To calculate MSE, sum up all the squared losses for individual examples and then divide by the number of examples:
where:
- (x,y) is an example in which
- x is the set of features (for example, chirps/minute, age, gender) that the model uses to make predictions.
- y is the example’s label (for example, temperature).
- prediction(x) is a function of the weights and bias in combination with the set of features x.
- D is a data set containing many labeled examples, which are (x,y) pairs.
- N is the number of examples in D.
Although MSE is commonly-used in machine learning, it is neither the only practical loss function nor the best loss function for all circumstances.