Decision Trees Flashcards
What is a decision tree?
It is a process of sorting things via a set of parameters and their order
What is the idea of Entropy of information?
The information of a message is inversely proportional to the probability of a message
What are the 2 equations fo the entropy of information?
Information:
I(x) = -log2P(x)
H = E[I] = ∑i - pilog2pi
What is the equation for total entropy?
This is the sum of all the entropies for each class
How can the entropy of information be applied to compression algorithms?
When we have 2 classes (binary), the probability of 1 and 0 can be related by the following equation: p0 + p1 = 1
How do you decide what parameters to use and in what order?
We want the process in which the change in entropy from a parameter is greatest
What is the equation for informatin gain?
G(S,F) = H(S) - ∑f∈values(F) |Sf| / |S| × H(Sf)
What is the ID3 philosophy?
Chose the method that has the greatest entropy gain
What is the ID3 algorithm?
IF: all examples have the same label => return a leaf with that label
ELSE IF: there are no features left to test => return leaf with the most common label
ELSE: Chose the feature F that maximises the informaation gain of S to be the next node. Add a branch of the node for each possible value f in F.
For each branch:
> Calculate Sf by removing F from the set of features
> Recursively call the algorithm with Sf to compute the gain relative to the current set of examples
What are the characteristics of the ID3 Algorithm?
> Greedy with respect to G. It always goes for the path with the greatest entropy gain. This can lead to a local minima
> Deals with noisy data by assigning the label to the most common class
> Always uses features so it is prone to overfitting
- Uses pruning
- Uses continueous varaibles - Can deal with missing attributes
What is the CART algorithm?
This is the same as the ID3 but instead of using entropy we use the gini impurity
What is the equation fo gini impurity?
G(s) = ∑iC (pi(1 - pi))
What is the equation for the gini split?
G(S,F) = G(S) - ∑f∈values(F) |Sf| / |S| × G(Sf)
What is the issue with ID3 and Gini?
Both algorithms are greedy so can find a local minima. It is important to produce different decision trees.
What is a random forrest?
This is creating 1000s of trees with different subsets. this is called a forrest.
Each tree uses a random fraction of the features and a random fraction of training data points