Decision Trees Flashcards
DS
What are the common uses of decision tree algorithms?
- Classification
- Regression
- Measuring feature importance
- Feature selection
What are the main hyperparameters that you can tune for decision trees?
Generally speaking, we have the following parameters:
max_depth: maximum tree depth
min_samples_split: minimum number of samples for a node to be split
min_samples_leaf: minimum number of samples for each leaf node
min_samples_leaf: minimum number of samples for each leaf node
max_leaf_nodes: the maximum number of leaf nodes in a tree
max_features: the maximum number of features that are evaluated for splitting at each node (only valid for algorithms that randomize features considered at each split)
Other similar features may be derived from the above hyperparameters.
The “traditional” decision tree is greedy and looks at all features at each split point, but many modern implementations allow splitting on randomized feature (as seen in sklearn), so max_features may or may not be a tuneable parameter.
Explain how each hyperparameter affects the model’s ability to learn.
Generally speaking,
max_depth: increasing max_depth will increase variance and decrease bias
min_samples_split: increasing min_samples_split (the minimum number of samples for a node to be split) decreases variance and increases bias (regulates overfitting)
min_samples_leaf: increasing min_samples_leaf (the minimum number of samples for each leaf node) decreases variance and increases bias (regulates overfitting)
max_leaf_nodes: increasing max_leaf_nodes increases variance and decreases bias
max_features: increasing max_features increases variance and decreases bias
There may be instances when changing hyperparameters has no effect on the model
What metrics are usually used to compute splits?
Gini impurity or entropy. Both generally produce similar results.
What is Gini impurity?
Gini impurity (aka Gini index) is a measurement of how often a randomly chosen record would be incorrectly classified if it was randomly classified using the distribution of the set of examples (i.e. metric for IMPURITY).
What do high and low Gini scores mean?
Low Gini (near 0) = most records from the sample are in the same class (LOW IMPURITY).
High Gini (maximum of 1 or less, depending on number of classes) = records from sample are spread evenly across classes (HIGH IMPURITY)
What is entropy?
Entropy is a measure of purity among non-empty classes. It is very similar to Gini in concept, but a slightly different calculation.
What do high and low entropy mean?
Low entropy (near 0) = most records from the sample are in the same class.
High entropy (maximum of 1) = records from sample are spread evenly across classes.
Are decision trees parametric or non-parametric models?
parametric refers to whether the model parameters are determined before creating the model
Decision trees are non-parametric. The number of model parameters is not determined before creating the model.
What are some ways to reduce overfitting in decision trees?
To reduce flexibility of a decision tree:
- reduce maximum depth
- increase min_samples_split
- balance your data to prevent bias toward dominant classes
- Increase the number of samples
- Decrease the number of features
How is feature importance evaluated in decision tree-based models”
The features that are split on most frequently and are closest to the top of the tree, thus affecting the largest number of samples, are considered the most important.
Explain what node purity means.
Each leaf node contains a yes/no prediction.
When a leaf node does not have 100% yes or 100% no, we call the node “impure”.
To consider which node is best, we need a way to measure and compare “impurity.”
How is Gini impurity computed for a leaf node?
e.g. given possible nodes [chest pain], [blood circulation], [blocked arteries]
Gini Impurity = 1 - P(“yes”)^2 - P(“no”)^2
e.g. [chest pain]
/ \
heart disease heart disease
“yes” “no” “yes” “no”
105 39 34 125
gini: 0.395 0.336
left node Gini = 1 - (105 / 105+39)^2 - (125 / 105+39)
= 0.395
right node Gini = 1 - (34/34+125)^2 - (125 / 34+125)^2
= 0.336
Thus, total Gini impurity for using chest pain to separate patients with and without heart disease is the wtd avg of the leaf node impurities.
total number left = 144
total number right = 159
Gini impurity for chest pain = wtd avg of Gini impurities for the leaf nodes
= 0.395 * (144 / 144+159) + 0.336 * (159/144+159)
= 0.364
fast forward:
Gini impurity Chest Pain = 0.364
Gini impurity Good Blood Circulation = 0.360
Gini impurity Blocked Arteries = 0.381
thus:
Good Blood Circulation had LOWEST Gini impurity, so Good Blood Circulation will be chosen as node of that particular tree.
As attributes are selected for each node, the number of observations feeding to next level are reduced as nodes approach leaves.
For continuous attributes, select node values as the midpoint between given numeric values.
Explain decision tree classifiers simply and how to use them
One reason for the popularity of tree-based models is their interpretability. In fact, decision trees can literally be drawn out in their complete form to create a HIGHLY INTUITIVE MODEL. From this basic tree system comes a wide variety of extensions from random forests to stacking.
Training a decision tree:
Decision tree learners attempt to find a decision rule that produces the greatest DECREASE in IMPURITY AT A NODE. While there are a number of measurements of impurity, by default DecisionTreeClassifier uses GINI impurity, G(t) = 1 - sum pi^2 where G(t) is the Gini impurity at node t and pi is the PROPORTION of observations of class c at node t. This process of finding decision rules to increase impurity is repeated recursively until all leaf nodes are PURE (CONTAIN ONLY ONE CLASS) or some arbitrary cut-off is reached.
If we want to use a different impurity measure than Gini, we can specify that in the “criterion” hyperparameter.
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
iris = datasets.load_iris() X,y = iris.data,iris.target
tree = DecisionTreeClassifier(random_state=0)
model = tree.fit(X,y)
# make new observation observation = [[5,4,3,2]]
# get prediction on new obs display(model.predict(observation)) #array([1])
# view predicted probas display(model.predict_proba(observation))# array([0.,1.,0.])
Explain how a tree regressor works
Decision tree regression works similarly to decision tree classification; however, instead of reducing Gini impurity or entropy, potential splits are by default measured on how much they reduce MSE.
Just like DecisionTreeClassifier, we can use the hyper “criterion” to select the desired measurement of split quality, e.g. we can construct a tree whose splits reduce mean abs error (MAE).
from sklearn.tree import DecisionTreeRegressor
from sklearn import datasets
# load daa with two features boston = datasets.load_boston() X,y = boston.data[:,0:2], boston.target
# create decision tree classifier obj tree = DecisionTreeRegressor(random_state=0)
# train model model = tree.fit(X,y)
# make new observation new_obs = [[.02,16]]
# predict new obs target model.predict(new_obs) # array([33.])