Ch4 training models Flashcards
What is the most common performance measure for a regression model and what is it formula?
The most common performance measure of a regression model is the mean squared error (MSE). Note the MSE is a convex function, which mean that if you pick any two points on the curve, the line segment joining them never crosses the curve. its formula is:
What is the closed-form solution (In other words, a mathematical equation that gives the result directly) of the linear regression model? This equation is also called the normal equation.
In this equation, the theta hat represent the value of theta (a column vector) that minimizes the cost function (most often the MSE) and the Y represent the column vector of target values containing y(1) to y(m)
Note: The normal equation may not work if the matrix XTX is not invertible, such as if m<n></n>
What is the computational complexity of finding the parameters, using the normal equation, for a linear regression model?
The SKlearn LinearRegression class is about O(n2). If you double the number of features, you multiply the computation time by roughly 4. Note however, that when using the model it become a O(n) situation. Training the model is the computationaly hard part.
What is the general idea of Gradient Descent?
Gradient Descent is a generic optimization algorithm capable of finding optimal solutions to a wide range of problems. Its tweak the parameters iteratively in order to minimize a cost function.
The algorithm measures the local gradient of the error function with regard to the parameter vector theta, and it goes in the direction of descending gradient (maximum downward slope). Once the gradient is zero, you have reached a minimum!
An importat parameter of gradient descent is the size of the steps, determined by the learning rate hyperparameter. If the learning rate is to slow it will take a lot o time to converge to a solution. On the other hand if it too high you may never find a optimal solution and make the algorithm diverge.
What are the three main variant of gradient descent?
1) Batch gradient Descent
2) Stochastic Gradient Descent
3) Mini-Batch gradient Descent
What is the main idea of Batch gradient Descent?
Batch gradient Descent: The batch gradient compute the gradient (slope or partial derivative) with regard to each model paramter theta. If MSE is our cost function and the learning rate is n, the formula is:
θnext=θ-n∇θ(MSE(θ))
Note at every step it calculate ∇θ(MSE(θ)) which make it very slow. The learning rate n is also really important. If it to high the solution may bounce up and down without ever reaching a solution and if it too low it will take more time to find the solution.
What is the main idea of Stochastic Gradient Descent?
The formula is the same as batch gradient descend. For example if using the MSE in a linear regression problem the formula would be:
θnext=θ-n∇θ(MSE(θ))
But the stochastic gradient Descent (SGD) only use a subset of the data to compute the gradient. After 1 iteration it discard the data and select another batch (randomly or pre-set) and re compute the data.
It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient (calculated from the entire data set) by an estimate thereof (calculated from a randomly selected subset of the data).
What is the main idea of mini-batch Gradient Descent?
The mini-batch gradient descend is a combination of Batch gradient and Stochastic gradient descent. The formula is the same as batch gradient descend. For example if using the MSE in a linear regression problem the formula would be:
θnext=θ-n∇θ(MSE(θ))
The mini-batch gradient Descent (SGD) only use a subset of the data to compute the gradient. After 1 iteration it discard the data and select another batch (randomly or pre-set) and re compute the data.
It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient (calculated from the entire data set) by an estimate thereof (calculated from a randomly selected subset of the data).
Why is a polynomial regression model capable of finding relationships between features?
PolynomialFeatures adds all combinations of features up to the given degree. For exemple, if there were two features a and b, PolynomialFeatures with degree = 3 would not only add the features a2,a3,b2 and b3, but also the combinations ab, a2b and ab2.
Note that since the relationships is of degree 2 and above, a linear regression cannot find the relationship between features. Also beware of the combinatorial explosion of the number of features!
Imagine you have a function, y = 0.5x12 + x1 + 2 + Gaussian noise generating random data point. What would happen if you used a regression model of degree 1 ? of degree 100 ?
The high-degree polynomial regression model would sevelery overfit the training data. While the linear model is underfitting it.
Imagine you have a function, y = 0.5x12 + x1 + 2 + Gaussian noise generating random data point. What would happen if you plotted the learning curve agaisnt the training set size for a regression model of degree 1 ? of degree 10 ? Note the learning curve here is how fast it is lowering the cost function the RMSE in this case.
For the underfitting model the error on the training data goes up until it reaches a plateau, at which point adding new instances to the trainin set does not make the average error much better or worse.
The overfitting model is similar to the previous ones, but there are two important difference:
1) the error on the training data is much lower.
2) there is a big gap between the curve which is a sign of overfitting.
What is the ridge regression?
Ridge regression is a regularized version of linear Regression. A regularization term is added to the cost function. This forces the learning algorithm to not only fit the data but also keep the model weights as small as possible. The cost function is now:
Cost = MSE(θ)+ a(1/2)Σθi2
Note that the regularization term should only be added to the cost function during training. Once the model is trained, you want to use the unregularized performance measure to evaluate the model’s performance. The hyperparameter a controls how much you want to regularize the model.
What do we need to do before performing Ridge Regression?
It is important to scale the data before performing Ridge Regression, as it is sensitive to the scale of the input features. This is true of most regularized models.
What does regularization tend to do?
It tend to increase the bias but lower the variance.
What is the closed form solution of the Ridge regression?
ô = (XTX+aA)-1XTy
A is the (n+1)x(n+1) identidy matrix, except with a 0 in the top-left cell, corresponding to the bias term. Note ô is used instead of theta for the paramaters estimation. The hyperparameter a controls how much you want to regularize the model.
What is the Lasso regression and what its formula?
The Least absolute shrinkage and selection operator regression (usually simply called Lasso Regression) is a regularized version of Linear Regression. The cost function is given by:
Cost = MSE(θ) + aΣ|θi|
What is the difference between the ridge regression and lasso regression?
They are both a regularized version of linear regression. The ridge use the L2 norm (square of the parameter) and the lasso use the L1 norm (absolute value of the parameter).
What is the most important characteristic of the Lasso Regression?
It tends to eliminate the weights of the least important features (i.e., set them to zero).