Part 2: K nearest neighbors Flashcards
Examples of classification tasks
- Abnormalities on screening mammograms.
- Classifying credit card transactions as legitimate or suspicious.
- Email: spam/not-spam
- Categorizing news stories in finance tweets (+,-,0)
Classification
Given a set of records (training set) find a model for class attribute as a function of the values of other attributes.
- In a training set, each record contains a set of attributes, one of the attributes is the class.
- A validation set is used to determine the accuracy of the model. Usually, the given dataset is divided into training and validation set, with training set used to build the model and validation set used to validate it.
Goal: previously unseen records should be assigned a class as accurately as possible.
Nearest-Neighbor Classifiers
Requires 3 things:
- Set of stored records.
- Distance metric to compute distance between records.
- Value of k, the numbers of nearest neighbors to retrieve.
To classify an unknown record:
- Compute distance to other training records.
- Identify k nearest neighbors.
- Use class labels of nearest neighbors to determine the class labels of unknown record.
K-nearest neighbors
K-nearest neighbors of a record X are data points that have the smalles distance to X. The value depends on the ‘number’ of nearest neighbors. 1-nearest neighbor has value -, 2-nearest neighbor has a neutral value, 3-nearest neighbor has a positive (+) value.
To make sure that you’re not getting neutral values, you can use k-odd. Then, you always have a majority.
Distance measures
- Euclidean distance
- Manhattan distance
Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes. - If an attribute value has the same distribution on the classes it does not discriminate.
- If attribute value has a different distribution than should it count in distance.
+ small distance implies that the discriminating attributes are equal and that also implies that they probably are in the same class.
+ larger distance implies that the discriminating attributes are different and that also implies that they are probably in a different class.
Choosing the value of k
- If k is too small, it is sensitive to noise points.
- If k is too large, the neighbourhood may include points from other classes.
General method to find the weights and k
The weights can be found by the gradient descent on a test set. This method keeps changing the weights until it is optimized. Optimizing k is quite different. You cannot apply the gradient descent, because it is a discrete variable. Therefor, you have to try many different values of k and look for the optimum (parabolic graph with minimum = k)
Curse of dimensionality
In high dimensions (many attributes) everything is far. There are no points nearby. Unless a large number of data is available (exponential in the number of dimensions).
Drawback: store the training records and use these records to predict the class labels of the unseen cases –> takes a lot of storage/computing power.
Non-parametric (instance based) methods
Nearest N
Naive bayes
Parametric (model based) methods
Linear regression