Chapter 7 Flashcards
k-nearest neighbors
Relies on finding “similar” records in training data
These Neighbors are used to derive a classification or prediction
voting (for classification) or averaging (for prediction)
how it works
Identify k records in the training data that are similar to a new record that we wish to classify
Then use the similar (neighboring) records to classify the new record into a class, assigning the new record to the predominant class among these neighbors
Look for records in the training data that are similar or “near” the record to be classified
Based on the class (of the proximate records), we assign a class to the new record
“Near” means
records with similar predictor values X1, X2, … Xp
characteristics of the model
Data-driven, not model-driven
Makes no assumptions about the form of the relationship between the class membership Y and the predictors X1, X2, … Xp
Nonparametric method because it does not involve estimation of parameters (coefficients) in an assumed function form, such as the linear form assumed in linear regression
How to measure “nearby”?
Central issue: How to measure distance between records based on their predictor values.
The most popular distance measure: Euclidean distance
The Euclidean distance between two records
(x1, x2, … xp) and (u1, u2… up) is:
Misclassification error of the one-nearest-neighbor scheme has a misclassification rate
that is no more than twice the error rate when we know exactly the probability density functions for each class
one-nearest-neighbor idea can be extended to k > 1 as follows
Find the nearest k neighbors to be classified
Use a majority decision rule to classify the record, where the record is classified as a member of the majority class of the k neighbors
Choosing k Advantage of choosing k>1
; Higher values of k provide smoothing that reduces the risk of overfitting
Balance between overfitting and ignoring predictor information
Typically values of k fall into a range of 1-20 (odd number to avoid ties)
Typically choose that value of k which has lowest error rate in validation data
k too low:
may be fitting the noise of the data
capture local structure in data (but also noise)
k too high
: may not capture the local structure of the data (a key advantage)
more smoothing, less noise, but may miss local structure
Extreme k=
number of records: assign all records to the majority class The naïve rule (a case of oversmoothing)
Advantages
Simple
No assumptions required about Normal distribution
Effective at capturing complex interactions among variables without having to define a statistical model
Shortcomings
takes a long time to find distances to all the neighbors and then identify the nearest one(s)
Required size of training set increases exponentially with # of predictors, p
“curse of dimensionality”
“curse of dimensionality”
Reduce dimension of predictors (e.g., with PCA