Classification - Part 1 Flashcards

Question

How is the purity gain defined?

Answer 1

Gain = P - M ``` P = impurity measure before splitting M = impurity measure after splitting ```

Answer 2

1 - the sum of the squared relative frequency of a class j at node t

Answer 3

Minimum: 0.0 (all records belong to one class) Maximum: 1-(1/nc) (if records are equally distributed among all class) nc = number of classes For two classes: 1-1/2 = 1/2 Gini increases if the buckets are less pure

Answer 4

GINI Index: Measure for node impurity GINI Split: Measures the quality of the overall split (GINI Index of each partition is weighted according to its size)

Answer 5

For each distinct attribute value count for each class

Answer 6

1. Sort the attribute on values 2. linearly scan the values and update count matrix and compute GINI index 3. choose the split position with smallest GINI index

Answer 7

An alternative impurity measure. Entropy measure the homogeneity of a node. ``` Minimum: 0.0 when all records belong to one class Maximum: log2 nc when records are equally distributed among all classes ```

Answer 8

- Information gain measure the entropy reduction of a split | - Choose the split with the largest reduction (maximal GAIN -> partition with the smallest entropy)

Answer 9

Tends to prefer splits that result in large number of partitions; each partition is small but pure like by ID attribute

Answer 10

- Designed to overcome the tendency of Information Gain to generate a large number of small partitions - GainRatio adjusts information gain by entropy of partitioning (SplitINFO) Penalizes large number of small partitions (higher entropy of the partitioning)

Answer 11

Gini Index measure impurity | Entropy measures information gain

Answer 12

A model trains patterns that are specific to the training data set thus the model performs poor on unseen (test) data Goal: Find a compromise between a specific and a general model

Answer 13

Symptoms: - Decision tree is too deep - Too many branches - Model works well on training set but performs bad on test set Causes: 1. Too little training data 2. Noise / outliers in training data 3. High model complexity An overfitted model does not generalize well to unseen data.

Answer 14

Underfitting: Model is too simple; both training an test errors are large Overfitting: Model is too complex; training error is small but test error is large

Answer 15

- Use more training data - Pre-Pruning - Post-Pruning - Ensembles

Answer 16

- If training data is under-representative, testing errors increase on increasing number of nodes - With more training data you can reduce the difference between training and testing errors at a given number of nodes Expensive prevention method

Answer 17

Stop the algorithm before tree becomes fully grown because shallower trees potentially generalize better (Occams razor) Early stopping conditions = Pre-pruning: - Number of instances in leaf node below threshold - Impurity improvement below a threshold

Answer 18

1. Grow decision tree fully 2. Trim nodes of the decision tree bottom up 3. Estimate generalization error before and after trimming (with Test data) 4. If generalization error improves after trimming - replace subtree (by leaf node or most frequent used branch)

Answer 19

- Learn different models - Each model votes on the final classification decision Idea: Wisdom of the crowds - A single classifier might focus too much on one aspect - Multiple classifiers can focus on different aspect

Answer 20

Ensemble consisting of a large number of different decision trees. They usually outperform single decision trees.

Answer 21

You need to use randomness in the learning process: - Randomly use different attributes of the training set for each tree - Randomly remove records from the training data (bagging)

Answer 22

Advantages: - Inexpensive to construct - Very fast at classifying unseen records - Easy to interpret for small trees - Can handle redundant or irrelevant attributes (good in feature selection) - Good accuracy for low dimensional data sets (not texts and images)

Answer 23

Disadvantages: - Space of possible decision trees is exponentially large (Greedy approaches are often unable to find the best tree) - Trees don't take into account interactions between attributes

Classification - Part 1 Flashcards

(47 cards)