Module 2 Flashcards
Nearest neighbour classifier
- classify instance to the class label of the nearest training instance
- non-parametric model
One nearest neighbour cons
- sensitive to noise
- overfit training data
Increasing k will make the classifier
- have a smoother decision boundary (higher bias)
- less sensitive to training data (lower variance)
Weighted k-NN
- assign a weight to each neighbour (based on how close they are)
- sum the weights per class in neighbourhood (assign to class with largest sum)
k-NN pros
- robust to noisy data
k-NN cons
- slow for large datasets
k-NN regression
Compute the mean value across k nearest neighbours
Locally weighted regression
- distance-weighted k-NN for regression
- compute the weighted mean value across k nearest neighbours
Decision Tree learning
- search for an “optimal” splitting rule
- split your your dataset
- repeat 1 & 2 on each new splitter subset
Entropy
A measure of the uncertainty of a random variable
Information Gain
Difference between the initial entropy and the (weighted) average entropy of the produced subsets
Ordered values
- for each feature sort its values
- consider only split points that are between two examples with different class labels
Categorical/Symbolic values
- find the most informative feature
- create as many branches as there are different values for this feature
Pruning
- target all nodes that are connected only to leaf nodes
- turn each into a leaf node
- repeat until all such nodes have been tested
Random forests
- use many decision tress
- each tree generated with random sample of training set & random subsets of features