Decision Trees Flashcards

Question 1

Q

What are the decision trees?

Answer

A

This is a type of supervised learning algorithm that is mostly used for classification problems. Surprisingly, it works for both categorical and continuous dependent variables.

In this algorithm, we split the population into two or more homogeneous sets. This is done based on most significant attributes/ independent variables to make as distinct groups as possible.

A decision tree is a flowchart-like tree structure, where each internal node (non-leaf node) denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a value for the target variable.

Various techniques : like Gini, Information Gain, Chi-square, entropy.

Question 2

Q

How do we train decision trees?

Answer

A

Start at the root node.

For each variable X, find the set S_1 that minimizes the sum of the node impurities in the two child nodes and choose the split {X,S} that gives the minimum over all X and S.

If a stopping criterion is reached, exit. Otherwise, apply step 2 to each child node in turn.

Question 3

Q

What are the main parameters of the decision tree model?

Answer

A

maximum tree depth

minimum samples per leaf node

impurity criterion

Question 4

Q

How do we handle categorical variables in decision trees?

Answer

A

Some decision tree algorithms can handle categorical variables out of the box, others cannot. However, we can transform categorical variables, e.g. with a binary or a one-hot encoder.

Question 5

Q

What are the benefits of a single decision tree compared to more complex models?

Answer

A

easy to implement
fast training
fast inference
good explainability

Question 6

Q

How can we know which features are more important for the decision tree model?

Answer

A

Often, we want to find a split such that it minimizes the sum of the node impurities. The impurity criterion is a parameter of decision trees. Popular methods to measure the impurity are the Gini impurity and the entropy describing the information gain.

Question 7

Q

What is random forest?

Answer

A

Random Forest is a machine learning method for regression and classification which is composed of many decision trees. Random Forest belongs to a larger class of ML algorithms called ensemble methods (in other words, it involves the combination of several models to solve a single prediction problem).

Question 8

Q

Why do we need randomization in random forest?

Answer

A

Random forest in an extention of the bagging algorithm which takes random data samples from the training dataset (with replacement), trains several models and averages predictions. In addition to that, each time a split in a tree is considered, random forest takes a random sample of m features from full set of n features (with replacement) and uses this subset of features as candidates for the split (for example, m = sqrt(n)).

Training decision trees on random data samples from the training dataset reduces variance. Sampling features for each split in a decision tree decorrelates trees.

Question 9

Q

What are the main parameters of the random forest model?

Answer

A

max_depth: Longest Path between root node and the leaf

min_sample_split: The minimum number of observations needed to split a given node

max_leaf_nodes: Conditions the splitting of the tree and hence, limits the growth of the trees

min_samples_leaf: minimum number of samples in the leaf node

n_estimators: Number of trees

max_sample: Fraction of original dataset given to any individual tree in the given model

max_features: Limits the maximum number of features provided to trees in random forest model

Question 10

Q

How do we select the depth of the trees in random forest?

Answer

A

The greater the depth, the greater amount of information is extracted from the tree, however, there is a limit to this, and the algorithm even if defensive against overfitting may learn complex features of noise present in data and as a result, may overfit on noise. Hence, there is no hard thumb rule in deciding the depth, but literature suggests a few tips on tuning the depth of the tree to prevent overfitting:

limit the maximum depth of a tree

limit the number of test nodes

limit the minimum number of objects at a node required to split

do not split a node when, at least, one of the resulting subsample sizes is below a given threshold

stop developing a node if it does not sufficiently improve the fit.

Question 11

Q

How do we know how many trees we need in random forest?

Answer

A

The number of trees in random forest is worked by n_estimators, and a random forest reduces overfitting by increasing the number of trees. There is no fixed thumb rule to decide the number of trees in a random forest, it is rather fine tuned with the data, typically starting off by taking the square of the number of features (n) present in the data followed by tuning until we get the optimal results.