Regression Forest Flashcards
What are the pros and cons of CART?
Pros: great at capturing non-linear, complicated panels, yielding a model that can be explain
Cons: It has poor prediction preformance, does not avoid overfitting
What is the CART problem?
Dependence on individual observations is high. And early decisions may depend on small differences between choices
Define ensemble methods.
Combine the results of many imperfect models to produce a prediction
Explain the bootstrap process.
- Start with the original dataset and draw many repeated samples with replacement
- Repeat when sample reaches the size of the original dataset
- Repeat for another tree.
Define bagging
B - bootstrap the sample (and create many samples
Agg = average the results
What are the benefits of the bagging process?
It increases the stability of results to create better out-of-sample performance
Why do we limit the number of x variables used for the random forest?
It helps reduce the risk of overfitting.
If one variable has high prominence over the others, it can cause multiple trees to follow the same path and look similar. Which leads to highly correlated trees
T/F: By decorrelating trees, we are artificially making each model worse, but together the model is better.
True
What are the most important elements of a random forest?
Bagging: aggregating the predictions of many trees grown on bootstrap samples of data
Stopping Rule: use a less restricting rule to let trees grow large
Decorrelate Trees: Only use a subset of the variables
What does a partial dependence plot show?
How average y differs for different values of xi when all other x values are the same
(partial because differences are conditional on all other x variables)
How do we decide which variables are most useful in predictions?
Use a variable importance plot