Gradient Boosting Flashcards
DS
How does Gradient Boosting (aka Gradient Boosting Machines, GBM) differ from traditional decision tree algorithms?
Gradient boosting involves multiple weak predictors (weak decision trees) to create a strong predictor. Specifically, it includes a loss function that calculates the gradient of the error with regard to each feature then iteratively creates new decision trees that minimize the current error. More and more trees are added to the current model to continue correcting error until improvements fall below some minimum threshold or a predetermined number of trees that have been created.
What hyperparameters can be tuned in Gradient Boosting that are in addition to each individual tree’s hyperparameters?
The main hyperparameters that an be tuned for a GBM are:
Loss function: loss function to calculate the gradient of the error
Learning rate: the rate at which the new trees correct/modify the existing predictor
n_estimators: the total number of trees to produce the final predictor
Additional hyperparameters: those specific to the loss_function
Some specific implementations, e.g. stochastic gradient boosting, may have additional hyperparameters such as subsample size (subsample size affects the randomization in stochastic variations)
How can you reduce overfitting when doing Gradient Boosting?
To reduce variance, reducing the learning_rate or reducing the maximum number of estimators are the two easiest ways to deal with GBM overfitting.
With stochastic gradient boosting, reducing subsample size is an additional way to reduce variance and combat overfitting.
Boosting algorithms tend to be vulnerable to overfitting, so knowing how to reduce overfitting is important.
Explain in English How Gradient Boosting Works
Say we have train data features [height, favorite color, gender] and target [weight].
Overview: Gradient Boost starts by making a single leaf, instead of a tree or stump; this leaf represents an INITIAL guess for the WEIGHTS of all the samples. The first guess is the average values of target weight.
Then Gradient Boost builds a tree, informed by errors made by previous trees. Unlike Adaboost, the weak trees are usually larger than stumps, but GradientBoost does build weak trees limited in size (users typically build fixed size between 8…32).
Like Adaboost, GradientBoost scales the trees, however, unlike Adaboost, the weak trees are scaled by constant amount.
Then GB builds another tree based on errors of previous tree and then scales the tree, and continues to build the number of weak trees specified or addl weak trees fail to improve fit.
step 1: init weights by assigning average target value, 71.2.
step 2. Build next tree based on first tree’s errors, where errors prev tree are new column pseudo-residual = (Observed weight - Predicted weight). (“pseudo-residual” is reminder we are doing GB, not linreg)
step 3 an on: build a tree using same features [height, favorite color, gender] but use the previous pesudo-error vector as the target, such that we predict the residuals instead of original [weight] target.
By restricting total number of leaves, we get fewer leaves than pseudo-residuals.
We replace the residuals for multiple sub-tree instances with the average of pseudo-residuals.
Explain in English How Gradient Boosting Works for Regression
Consider the super-simple train data set X, y:
height favorite color gender weight
- 6 blue male 88
- 6 green female 76
- 5 blue female 56
Given a gradient-descent differentiable loss func,
L(yi, f(x)) = 0.5(yi - yhat i)^2
We can eval the perf of the model by taking sum of the loss for each xi, yi. i.e. compare RSS for different linreg fits. Note: 1/2 is multiplied to loss for chain rule to cancel out the 1/2, from derivative a of a square.
Step 1: INITIALIZE model with constant value F(x) = argmin gamma Sum(yi,gamma) = Sum 1/2(yi - yhat i)^2
The value of gamma (vector yhats) that minimizes the INITIAL Loss is the average of the observed weights (88+76+56)/3 = 73.3.
i.e. the initial predicted value is one leaf, and the prediction for yi [weight] is that all samples (rows) will equal a constant 73.3.
Step 2 is a loop where we make all of the trees, for m = 1…M.
we make M trees, but in prac, most users set M=100 for 100 trees.
Step 2a: in general compute gradient descent
r im = -[del L(yi, F(xi) / del F(xi)]
= -1 * [d/ d predicted * 1/2 (observed - predicted)^2]
= -1* [ - (observed - predicted) ]
= (observed - predicted) = residual
next, replace “predicted” in above equation with previous F m-1(x), (observed - F m-1(x)), and continue to iterate.
so first round pseudo-residuals are:
r 1,1 = (observed - 73.3) = (88 - 73.3) = 14.7
r 1,2 = (76 - 73.3) = 2.7
r 1,3 = (56 - 73.3) = -17.3
height favorite color gender weight pseudo-resids
- 6 blue male 88 14.7
- 6 green female 76 2.7
- 5 blue female 56 -17.3
Part 2b is to use reg tree to predict the pseudo-residuals and create “terminal regions R j,m”
[height<1.55] / \ -17.3 14.7, 2.7 R 1,1 R 2,1 Step 2c determine output values for ea. leaf. Specifically, since two resids ended up in right leaf, its unclear what output should be. So for each leaf in new tree, we compute an "output value, gamma j,m, where the output value for ea. leaf is the value which minimizes the following summation:
for j=1..Jm compute gamma j,m = argmin (xi in R i,j) Sum L(yi, F m-1 (xi) + gamma)
similar to step 2a optimization but small difference is that we take PREVIOUS prediction into account (before there was no previous prediction in the gradient descent eq).
Note: the optimization also differs from initial GD eq since the term (xi in R i,j) means only one sample, x3, terminates in left leaf R 1,1 then only sample x3 is used to calculate the output value R 1,1 and only samples x1, x2 are used to calc the GD output values in right leaf R 2,1.
gamma 1,1 = argmin_gamma 1/2 (y3 - F m-1 (x3) + gamma)^2
= argmin_gamma 1/2( 56 - 73.3 + gamma)^2 = argmin_gamma 1/2( -17.3 + gamma)^2 apply chain rule to solve for optimal gamma: d/d gamma = 1/2 ( -17.3 - gamma)^2 = 1/2 (-1)*2 (-17.3 - gamma) = 17.3 + gamma set: d / d gamma = 0 and solve for gamma:
17.3 + gamma = 0 ==> gamma = -17.3
so the value of gamma 1,1 that minimizes GD eq is gamma 1,1 = -17.3 so the leaf has an output value of -17.3.
similarly for right leaf output value:
d /d gamma := [1/2 * (14.7 - gamma)^2 + 1/2(2.7 - gamma)^2]
(apply chain rule, and set deriv to zero:)
=: -14.7 + -2.7 + 2 gamma = 0
=: gamma = (14.7 + 2.7) / 2 = 8.7
i.e. we end up with average of the residuals that ended in leaf R 2,1! So the leaf R 2,1 has output value of 8.7
Part 2d Make a new/udate prediction for ea sample:
update F m(x) = F m-1 (x) + nu*Sum gamma j,m I(x in R j,m)
the summation says we add all output values, gamma j,m for all leaves R j,m that a sample xi can be found in.
“nu” is learning rate in (0,1) and reduces effect ea tree has on final pred and improves long-run accuracy.
set nu = 0.1
updating predictions:
F sample 1(x) = F 0(x) + nu + (gamma 1,1 + gamma 2,1)
= 73.3 + 0.1*(8.7) = 74.2
(new resid 1 := 88-74.2 = 13.8)
F sample 2(x) = 73.3 + 0.1*8.7 = 74.2
(new resid 2 := 76-74.2 = 1.8)
F sample 3(x) = 73.3 + 0.1*(-17.3) = 71.57
(new resid 2 := 56-71.57 = -15.57)
Note that all residuals improved over prior resids!
height favorite color gender weight pseudo-resids
- 6 blue male 88 14.7
- 6 green female 76 2.7
- 5 blue female 56 -17.3
If we only specified 2 rounds, then we are done with GB optimization and we can compute predictions on tesst data:
e.g. say we have new data
height favorite color gender weight pseudo-resids
1.4 green male ???
we can use F 2(x) to predict weight:
=: F 0(x) + F 1(x)
=: 73.3 + (0.1 * -17.3) + (0.1 * -15.57) = 70.01