Lecture 7 - K-Nearest Neighbors Flashcards
Classification
- Examine data for which classification is unknown using data with known outcome
- Goal is to predict that classification
-
Learn classification from the training data
- Relationship between predictors and outcome
-
Apply on testing data, which also includes known outcomes, suing the selected model finally
- Measure how well it will do on unknown data
Example classification
The k-Nearest Neighbors Classifier
- Identify the neighbours of the new record that we wish to classify
- I.e., the k records in the training dataset that are similar to / close by the new record
- Use the neighbours (i.e., these k records) to classify the new record into a class
- Assign the new record to the predominant class among these neighbors
Steps in the k-Nearest Neighbors Classifier
- Determining the item’s neighbors
- Choosing the number of neighbors, i.e., value k
- Computing classification (for a categorical outcome) or prediction (for a numerical outcome)
Determine record’s neighbors
Euclidean Distance
Example KNN quiz
Euclidean Distance
- Highly scale dependent
- I.e., units of one variable can have a huge influence on the results, for example from cents to dollars
- Solution is normalising the values before computing
- This converts all measurements to the same scale
- → Subtract average and divide by standard deviation
Example
- Average sales amount across 22 utitlies is 8914.045
- Standard deviation is 3549.984
- Sales for Airzona Public Service is 9077
- Normalized sales is (9077-8914.045)/3549.984 = 0.046
Euclidean distance pt. 2
Choosing the value for k
- k is too low: may be fitting to the noise in the dataset
- k is too high: miss out on the method’s ability to capture the local structure in the dataset, one of its main advantages
- k is the number of records in the training dataset: assign all records to the majority class in the training data
- Balanced choice depends on the nature of the data
- E.g., the more complex and irregular the structure of the data, the lower the optimum value of k
- Typically: values of k fall in the range 1 to 20
- Use odd numbers to avoid ties
How is k chosen?
- We use the training data to classify the records in the testing dataset, i.e., use different values for k
- Compute error rates for various choices of k
- Choose k with the best classification performance
BUT
- Testing dataset is now used as part of the training process (to set k)
- We need a new dataset to evaluate the model performance on data that it dit not see
Numerical outcome
- Algorithm can be extended to predict continuous values, instead of categorical values
- First step remains unchanged i.e., determine neighbours by computing distances
- Second step must be modified i.e., determining class through majority voting
- Determine the prediction by taking the average outcome value of the k-nearest neighbors
Advantages
- Simplicity of the method
- Lack of parametric assumptions
Performs surprisingly well especially when
- There is a large enough training set present
- Each class is characterised by multiple combinations of predictor values
Shortcomings
- Computing the nearest neighbours can be time consuming
Solutions: reduce time taken to compute distances by working on fewer dimensions, generated using dimension reduction techniques. Speed up identification of nearest neighbours using specialized data structures
- For every record to be predicted, we compute its distance from the entire set of training records only at the time of prediction. Known as “lazy learner”
→ This behaviour prohibits using this algorithm for real-time prediction of al are number of records simultaneously
- Number of records required in the training set to qualify as large, increases exponentially wit the number of predictors. Known as “curse of dimensionality”
Solution: Reduce the number of predictors
Classification with Python