Train And Evaluate Regression Models Flashcards
Regression is a commonly used kind of machine learning for predicting numeric values
Regression is where models predict a number.
In machine learning the goal of regression is to create a model that can predict a numeric quantifiable value such as a prize amount size or other scalar number
Regression is a statistical technique of fundamental importance to science because of its ease of interpretation robustness and speeding calculation.
Regression models provide an excellent foundation for understanding how more complex machine learning techniques work
In real-world situations part regularly when little daughter are available regression models are used are very useful for making predictions.
For example if a company that rents bicycles wants to predict the expected number of rentals on any given day in the future a regression model can predict this number.
You could create a model using existing data such as the number of bicycles that will rent it on days where the season day of the week and sell one who also recorded
Regression works by establishing a relationship between variables and the data that represents characteristics known as the features of the thing being observed and the variable we are trying to predict known as the label
To train the model we start with a data sample containing the features as well as the known values for the label
The data sample is split into two subsets:
A training darts it to which will apply an algorithm that determines a function encapsulating the relationship between the feature values and their own label values.
A validation or a test dataset that we can use to evaluate the model by using it to generate predictions for the label and comparing them to the actual none label values.
The use of historic data with no label values to train a model motor regression an example of supervised machine learning
Note.
Machine learning is based in statistics and math and it is important to be aware of specific terms that statisticians and mathematicians and their for data scientists use.
You can think of the difference between a predicted label value and the actual label value as a measure of error.
However in practice the actual values are based on the sample observations which themselves might be subject to some random variants.
To make it clear that we are comparing a predicted value with then observed value their efforts of the difference between them as the residuals.
We can summarise the residuals of all of the validation data predictions to calculate the overall loss in the model as a measure of its predictive performance.
One of the most common ways to measure the last is to square the individual residuals sum of the squares and calculate the mean.
Squaring the residuals has the effect of basing the calculation on Absolute values and ignoring whether the difference is negative or positive and giving more weight to larger differences.
This metric is called the mean squared error
Sometimes it is more useful to express the last in the same unit of measurement as the projected labelvalue itself.
It is possible to do this but calculating √ the NSA or the mean Square error which produces a metric known and surprisingly as the root mean Square error or rmse
There are many other metrics that can be used to measure loss in a regression.
For example are R Squared sometimes known as coefficient of determination is the correlation between X and Y squared.
This produces a value between 0 and 1 that measures the amount of variance that can be explained by the model.
Generally the closer the value is to 1 the better the model predicts
You can quantify the residuals for calculating a number of commonly used evaluation metrics.
Mean Square error or mse:
The mean of the squared differences between predicted and actual values.
This yields a relative metric in which the smaller the value the better the models fit
Root mean Square error or RMS E-Toll on new line √ the msc.
This yields and absolute metric in the same unit as the label.
The smaller the value the better the model in a simplistic since it represents the average number by which the predictions are wrong
Coefficient of determination usually known as r squared:
A relative metric in which the higher the value the better than models with.
In essence this metric represents how much of the variance between predicted and actual label values the model is able to explain