lecture 9: decision trees and random forest Flashcards
what can decision trees be compared to visually?
a flowchart algorithm
how do we make predictions using decision trees more accurate?
increase the level of depth
what are the 3 types of nodes
root node, decision node and terminal node/leaf
what does the depth number represent?
the number of decisions needed to go from the root node to the nodes at that depth
what is classification tree learning
the construction of a classification tree given training data
we want to to obtain the least complex tree possible while having low training error, how do we do so?
use some greedy algorithm not guaranteed to find the best tree
what can we aim for such that the classification tree has low training error
node purity
what are the three node impurity measures
gini, entropy and misclassification rate
how do we calculate the gini impurity in each node
let pᵢ = the fraction of class i data samples in the node\
let k = number of classes
Qᵤ = node impurity in node u
= 1 - ∑[1 to k] pᵢ²
how do we calculate the overall gini impurity at a said depth n
for j nodes at depth n
gini impurity = ∑[1 to j] fraction of data samples in node x Q(gini impurity if node)
what is the depth at the root node
0
what are the disadvantages of decision trees
trees can become overly complex leading to overfitting
trees can be unstable which means small changes in training data results in very diff trees
what are the methods to reduce overfitting
- set max depth for tree
- set minimum number of samples for a lead node
- set minimum decrease in impurity
- split by looking at a subset of features instead of all the features
how can we use decision trees for regression problems
instead of minimising some impurity measure, seek to minimise some loss function like mse instead
how to reduce instability
- average predictions from a certain number of trees
- random forest