Data Mining, Classification Flashcards
Approaches to classification
Decision trees
bayesian classification
classification rules
neural networks
k-nearest nieghbours
SVM
Classification Requirements
- Accuracy
- Interpretability
- Scalability
- Noise and outlier management
Definition of classification
Given a collection of labels and one of data objects labelled, find a descripitve model that will enable to distinguish an object of one class from another.
Criteria to evaluate classification techniques
- Accuracy
- Efficiency
- Scalability
- Robustness
- Interpretability
Decision Trees, Hunt’s Algo
Dt -> set of training records that reach a node t.
Steps:
- if Dt contains records that belong to more than one class -> select the best attribute A on which to split Dt andbel nodet as A, split Dt in smaller subsets and recursively apply the procedure to each subset.
- If Dt contains records that belong to the same class yt then t is a leaf node labeled as yt
- if Dt is empty then t is a leaf node labelled as the default (most common) class.
Decision tree induction
Adopts a greedy strategy -> not global optimum
Issues:
- structure of test conditions (binary splits vs multiway split):
- depends on attribute type: nominal, ordinal, continuos.
- selection of the best attribute for the split
- stopping condition for the algorithm
Selection of best attribute in decision tree
Attributes with homogeneous class distribution are preferred, degree of impurity must be low.
Node impurity needs a measure:
- gini index
- entropy
- misclassification error
Different algortithms rely on different measures of impurity.
GINI index
- Used to compute node impurity in decision trees
- Maximum (1-1/nc) when records are equally distributed among all classes.
- Minimum (0) when all records belong to one class.
Computing quality of a split using GINI
This technique is used in CART, SLIQ, SPRINT
When a node p is split into k partitions (children), the quality of the split is.
Categorical attribute:
- For each distinct value gather counts for each class in the dataset
- use the count matrix to make decisions
Continuos Attribute:
- Binary decision on one splitting value
- possible splittings = number of distinct values
- Each splitting value has a count matrix
- Sort attribute on value
- Linearly scan these values, each time updating the count and computing gini index
- Choose the split position that has the least gini index
Entropy impurity measure (INFO)
INFO
- max (log nc) when records are equally distributed
- min (0)
- Entropy computations are very similar to GINI index’s ones.
Information gain:
- measures reduction in entropy achieved because of the split.
- Disadvantage: tends to yield higher number of partitions, each small but pure.
Gain Ratio:
- Adjust information gain by the entropy of the partitioning. Higher entropy partitioning is penalized.
- Designed to overcome shortcomings of information gain.
Classification error impurity measure
Meausures classification error made by current node.
Same maximum and mimum of GINI index.
Stopping criteria for tree induction
Stop expanding node when:
- All records belong to the same class
- All the records have similar attribute values
- Early termination:
- pre-pruning
- post-pruning
Overfitting vs Underfitting,
Noise overfitting
Underfitting: model is too simple, training and test set errors are high.
Addressing overfitting
Pre-pruning (early stopping):
- Stop the algo before it becomes a fully-grown tree
- Typical stopping conditions for a node
- if all instances belong to same class
- all the attribute values are the same
- More restrictive conditions:
- Stop if cardinality of instances is less than some user-specidfied threshold
- Stop if class distribution are independent of the available features
- Stop if expanding the current node does not improve impurity measures
Post-pruning:
- Grow decision tree to its entirety
- Trim the nodes of the decision tree in a bottom up fashion
- If generalization error improves after trimming, replace sub-tree by a leaf node.
- Class label of leaf node is determined from majority class of instances in the sub-tree
Issues of decisions tree
- Data fragmentation
- Number of instances gets smaller as you travers down the tree, and that could be too small to make any statistically significant decision.
- Handling missing attribute values
- affect how impurity measures are computed
- affect how to distribute instance with missing value to child nodes
- Affect how a test instance with missing value is classified.
- Search strategy
- Finding optimal decision tree is NP-hard
- Algorithm presents so far uses a greedy, top-down, recursive partitioning strategy.
- Expressiveness
- Decision tree provide expressive representation for learning discrete-valued function (example: parity function)
- Not expressive enough for modeling continuos variables, in particular when test condition involves only a single attribute at a time.
- Tree Replication