05 Decision Trees Flashcards

Question 1

Q

Multistage property: info[2,3,4]=…

Entropy function scales from…

Cross-Entropy: Explain the function cross entropy (p,q)

What about KL-Divergence?

Answer

A

info[2,7]+7/9 info[3,4]

0 to max log (base2) of n

-(plog(q) + qlogp)–> outcome distribution, ground truth; The cross-entropy is not symmetric!
Cross-entropy is often used as loss function in neural networks for classification.

KL divergence measures the difference between distributions, cross-entropy the difference in the predictions of a model and the actual data. –> cross-entropy MINUS entropy==> summation over i pi log(pi/qi))

Question 2

Q

Split for ID Code Attribute–>What’s the issue and how to remedy?

Answer

A

Entropy of split = 0 (since each leaf node is “pure”), having only one case. Information gain is maximal for ID code.

Subsets are more likely to be pure if there is a large number of values.
* Information gain is biased towards choosing attributes with a large number of
values.
* This may result in overfitting (selection of an attribute that is non-optimal for
prediction).

USE Gain ratio: a modification of the information gain that reduces its bias on high-branch attributes.
Gain ratio takes number and size of branches into account when choosing an attribute.
It corrects the information gain by taking the intrinsic information of a split into account (i.e. how much info do we need to tell which branch an instance belongs to).

Question 3

Q

However:
* “ID code” has still greater gain ratio (0.246).
* Standard fix: …
Problem with gain ratio: it may overcompensate.
* May … * Standard fix:
− First, only consider … − Then, compare them on gain ratio.

Answer

A

ad hoc test to prevent splitting on that type of attribute.

choose an attribute just because its intrinsic information is very low.

attributes with greater than average information gain

Question 4

Q

Select the split that … the Gini Index most. This is done over all possible places for a split and all possible variables to split.

Time Complexity of Basic Algorithm
* Let 𝑚 be the number of attributes and n the number of instances

Answer

A

decreases

O(mn* logn)

Question 5

Q

Scalability of DT Algorithms
* need to design for…
* two things to worry about: …
o leads to a large tree o takes a long time
− large amounts of data
…

Answer

A

large amounts of data

− large number of attributes

o Can the data be kept in memory?
o some new algorithms do not require all the data to be memory resident

Question 6

Q

For an algorithm to be useful in a wide range of real-world applications it must: ….x3

Binary Splits on Numeric Attributes: disadvantage and remedy

Answer

A

permit numeric attributes
allow missing values
be robust in the presence of noise

Disadvantage: tree is hard to read.
Remedy:
* pre-discretize numeric attributes, or
* allow for multi-way splits instead of binary ones using the Information Gain
criterion

Question 7

Q

Handling Missing Values / Training
Ignore instances with missing values.
* pretty harsh, and …
Ignore attributes with missing values. * again, may not be feasible
Treat missing value as ..
* fine if missing a value has a significant meaning
Estimate missing values.
* data imputation: …

Decision trees are a classification technique.
The output of decision trees can be used for …
They can represent any function in the form of propositional logic. Heuristics such as information gain are used to select relevant attributes. …avoid overfitting.

Answer

A

missing value might not be important

another nominal value.

regression, nearest neighbor, mean, mode, etc.

descriptive as well as predictive purposes.

Pruning is used to