Week 3: Regression & Classification (Tree-Based Models) Flashcards
Algorithm 1R
A method for computing rules based on the frequencies of individual attributes.
- Loop through all attributes
1.1 Loop through each attribute value and class
1.1.1 Build rule based on the most frequent class.
1.1.2 For this attribute value, c_freq(v) = most frequent class for attribute value v - Calculate the rule error rates, and choose the rules with the smallest error rates.
Accuracy for Classification Rules
accuracy = number of correctly predicted instances / number of all instances covered
Gini Coefficient/Impurity Measure
G = \sum_{k=1}^C Pr(k)^2 (in practice for the module). The higher the Gini coefficient, the purer the split. Splits should maximise Gini coefficient.
Entropy
E = (-1) \cdot [\sum_{i=1}^C Pr(k) \cdot \log_2 (Pr(k))]
The lower the entropy, the purer the split. Splits should minimise entropy. Higher entropy means higher disorder.
Information Gain
IG = E(v) = [f_1 E(v_1) + … + f_\ell E(v_{\ell})]
v = parent node
v_1,…,v_{\ell} = child nodes
f_k = proportion of instances in child node over all instances. Ideally, information gain needs to be maximised.
Chi-square
\chi^2 = \sum_{k=1}^C \frac{(O_k - E_k)^2}{E_k}
Observed Frequency = Cell value
Expected Frequency = (Column Frequency * Row Frequency) / (Total Frequency)
Classification and Regression Trees (CART)
Input: Numerical and Categorical
Split Selection: Gini Index
Pruning: Adjusted Error Rate
This method grows binary trees as long as it can find new splits that increase purity. Then it replaces any subtree whose leaves make the same classification with their common parent. It exploits the tradeoff between number of leaves and error rate.
Iterative Dichotomiser (ID3)
Input: categorical only
Split Selection: Information Gain
Pruning: No
This method randomly chooses a subset of the training set and builds the decision tree using this subset. It then evaluates this decision tree on the other instances. If the error rate is sufficiently low terminate. Otherwise, it adds the misclassified instances to the subset and repeats.
C5.0
Input: numerical and categorical
Split Selection: Information Gain
Pruning: Error rate computation with statistical confidence intervals
This method is similar to CART, but can make multiway splits on categorical variables.
Decision Trees
Pros:
1. Can handle mixed data types.
2. Can deal with missing values.
Cons:
1. Decision boundaries are typically parallel to input axes.
2. Overfitting might occur if not properly designed.
Leave-One-Out Cross-Validation
A type of N-fold cross-validation, where one fold is left out of the training process entirely and used as the validation set. There is no sampling bias to the lack of random sampling.
Bootstrapping
Unlike cross-validation, which uses sampling without replacement, bootstrapping uses sampling with replacement. In each of n trials, the chosen samples are used for training and the non-chosen samples are used for testing. The chance of being chose is 1/n for the n-th trial. As n approaches infinity, the training set size is approximately 63.2% of all instances.
True Positive Rate
(True Positives) / (True Positives + False Negatives)
False Positive Rate
(False Positives) / (False Positives + True Negatives)
Cohen’s Kappa
This metric measures the level of agreement between two classifiers
\kappa = \frac{Pr(a) - Pr(e)}{1 - Pr(e)}
If there are two classifiers, C and C’, and two classes, A and B:
aa: C and C’ agree the instance is A
bb: C and C’ agree that the instance is B
ab, ba: when C and C’ disagree about the classification.
Assuming there are n instances covered by C and C’
Pr(a) = \frac{aa + bb}{n}
Pr(e) = (\frac{aa + ba}{n}) \times (\frac{aa+ab}{n}) + (\frac{bb + ab}_n}) \times (\frac{bb+ba}{n})