05 Decision Trees Flashcards
- Multistage property: info[2,3,4]=…
Entropy function scales from…
Cross-Entropy: Explain the function cross entropy (p,q)
What about KL-Divergence?
info[2,7]+7/9 info[3,4]
0 to max log (base2) of n
-(plog(q) + qlogp)–> outcome distribution, ground truth; The cross-entropy is not symmetric!
Cross-entropy is often used as loss function in neural networks for classification.
KL divergence measures the difference between distributions, cross-entropy the difference in the predictions of a model and the actual data. –> cross-entropy MINUS entropy==> summation over i pi log(pi/qi))
Split for ID Code Attribute–>What’s the issue and how to remedy?
Entropy of split = 0 (since each leaf node is “pure”), having only one case. Information gain is maximal for ID code.
Subsets are more likely to be pure if there is a large number of values.
* Information gain is biased towards choosing attributes with a large number of
values.
* This may result in overfitting (selection of an attribute that is non-optimal for
prediction).
USE Gain ratio: a modification of the information gain that reduces its bias on high-branch attributes.
Gain ratio takes number and size of branches into account when choosing an attribute.
It corrects the information gain by taking the intrinsic information of a split into account (i.e. how much info do we need to tell which branch an instance belongs to).
However:
* “ID code” has still greater gain ratio (0.246).
* Standard fix: …
Problem with gain ratio: it may overcompensate.
* May … * Standard fix:
− First, only consider … − Then, compare them on gain ratio.
ad hoc test to prevent splitting on that type of attribute.
choose an attribute just because its intrinsic information is very low.
attributes with greater than average information gain
Select the split that … the Gini Index most. This is done over all possible places for a split and all possible variables to split.
Time Complexity of Basic Algorithm
* Let 𝑚 be the number of attributes and n the number of instances
decreases
O(mn* logn)
Scalability of DT Algorithms
* need to design for…
* two things to worry about: …
o leads to a large tree o takes a long time
− large amounts of data
…
large amounts of data
− large number of attributes
o Can the data be kept in memory?
o some new algorithms do not require all the data to be memory resident
For an algorithm to be useful in a wide range of real-world applications it must: ….x3
Binary Splits on Numeric Attributes: disadvantage and remedy
- permit numeric attributes
- allow missing values
- be robust in the presence of noise
Disadvantage: tree is hard to read.
Remedy:
* pre-discretize numeric attributes, or
* allow for multi-way splits instead of binary ones using the Information Gain
criterion
Handling Missing Values / Training
Ignore instances with missing values.
* pretty harsh, and …
Ignore attributes with missing values. * again, may not be feasible
Treat missing value as ..
* fine if missing a value has a significant meaning
Estimate missing values.
* data imputation: …
Decision trees are a classification technique.
The output of decision trees can be used for …
They can represent any function in the form of propositional logic. Heuristics such as information gain are used to select relevant attributes. …avoid overfitting.
missing value might not be important
another nominal value.
regression, nearest neighbor, mean, mode, etc.
descriptive as well as predictive purposes.
Pruning is used to