Decision Trees Flashcards
What are the decision trees?
This is a type of supervised learning algorithm that is mostly used for classification problems. Surprisingly, it works for both categorical and continuous dependent variables.
In this algorithm, we split the population into two or more homogeneous sets. This is done based on most significant attributes/ independent variables to make as distinct groups as possible.
A decision tree is a flowchart-like tree structure, where each internal node (non-leaf node) denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a value for the target variable.
Various techniques : like Gini, Information Gain, Chi-square, entropy.
How do we train decision trees?
Start at the root node.
For each variable X, find the set S_1 that minimizes the sum of the node impurities in the two child nodes and choose the split {X,S} that gives the minimum over all X and S.
If a stopping criterion is reached, exit. Otherwise, apply step 2 to each child node in turn.
What are the main parameters of the decision tree model?
maximum tree depth
minimum samples per leaf node
impurity criterion
How do we handle categorical variables in decision trees?
Some decision tree algorithms can handle categorical variables out of the box, others cannot. However, we can transform categorical variables, e.g. with a binary or a one-hot encoder.
What are the benefits of a single decision tree compared to more complex models?
easy to implement
fast training
fast inference
good explainability
How can we know which features are more important for the decision tree model?
Often, we want to find a split such that it minimizes the sum of the node impurities. The impurity criterion is a parameter of decision trees. Popular methods to measure the impurity are the Gini impurity and the entropy describing the information gain.
What is random forest?
Random Forest is a machine learning method for regression and classification which is composed of many decision trees. Random Forest belongs to a larger class of ML algorithms called ensemble methods (in other words, it involves the combination of several models to solve a single prediction problem).
Why do we need randomization in random forest?
Random forest in an extention of the bagging algorithm which takes random data samples from the training dataset (with replacement), trains several models and averages predictions. In addition to that, each time a split in a tree is considered, random forest takes a random sample of m features from full set of n features (with replacement) and uses this subset of features as candidates for the split (for example, m = sqrt(n)).
Training decision trees on random data samples from the training dataset reduces variance. Sampling features for each split in a decision tree decorrelates trees.
What are the main parameters of the random forest model?
max_depth: Longest Path between root node and the leaf
min_sample_split: The minimum number of observations needed to split a given node
max_leaf_nodes: Conditions the splitting of the tree and hence, limits the growth of the trees
min_samples_leaf: minimum number of samples in the leaf node
n_estimators: Number of trees
max_sample: Fraction of original dataset given to any individual tree in the given model
max_features: Limits the maximum number of features provided to trees in random forest model
How do we select the depth of the trees in random forest?
The greater the depth, the greater amount of information is extracted from the tree, however, there is a limit to this, and the algorithm even if defensive against overfitting may learn complex features of noise present in data and as a result, may overfit on noise. Hence, there is no hard thumb rule in deciding the depth, but literature suggests a few tips on tuning the depth of the tree to prevent overfitting:
limit the maximum depth of a tree
limit the number of test nodes
limit the minimum number of objects at a node required to split
do not split a node when, at least, one of the resulting subsample sizes is below a given threshold
stop developing a node if it does not sufficiently improve the fit.
How do we know how many trees we need in random forest?
The number of trees in random forest is worked by n_estimators, and a random forest reduces overfitting by increasing the number of trees. There is no fixed thumb rule to decide the number of trees in a random forest, it is rather fine tuned with the data, typically starting off by taking the square of the number of features (n) present in the data followed by tuning until we get the optimal results.