Data Mining - Lecture k-Nearest Neighbor Flashcards
What is the k-Nearest Neighbor in short?
A classification model. You specify k records. For every new records, your model looks at the closest k records. The predominant class of those records will be the class for the new record.
The distance can be computed with the Euclidean Distance or the Manhattan distance.
What are the three steps of the Nearest Neighbor model?
- Determining the item’s neighbors
- Choosing the number of neighbors (k)
- Computing the classification (categorical) or prediction (numerical)
How is the distance between records written down?
If you have two records, record i and record j, these will be abbreviated to ri and rj.
The distance between those records is dij.
Student number
2064381
What are the properties of the distance between records?
- the distance needs to be positive. (dij > o)
- distance between a single record is 0 ( dii = 0)
- Symmetry in distance (dij = dji)
- The distance of a pair has to be smaller than the sum of two pair distances
What is the Euclidean distance and how do you compute it?
It is a measure to compute distances between variables.
If a record has multiple independent variables (Xn) you compute it by:
dij = Root( (Xi1 - Xj1)^2 + (Xi2 - Xj2)^2 ……)
= Je haalt xj1 van xi1 en doet dat antwoord in het kwadraat. Dat doe je voor alle variabelen. Deze tel je bij elkaar op en daar neem je de wortel van.
What is the problem with the euclidean distance and how do you solve that issue?
- It is highly scale dependent.
- >You can normalize all measurements by subtracting the average from the x and divide it by the standard deviation. - It is sensitive to outliers
- > Use Manhatten distance
What is the Manhattan distance?
You look at the absolute values.
dij = |xi1 - xj1| + |xi2 - xj2| ……..
= Je haalt xj1 van xi1 af. Je haalt het teken voor het cijfer weg (absoluut), en je telt ze allemaal bij elkaar op.
What happens if K is too low?
The model might take into account outliers, weird values (=noise of the model) and will not classify well.
What happens if k it too high?
You miss out on the methods ability to capture the local structure in the dataset.
What happens if k is the exact same as the number of records in the training set?
All records will be assigned to the majority class in the training data -> model will suck
What is the normal range for k?
Between 1 and 20.
You try to use odd numbers to avoid ties.
How do you choose k?
You can use the k-nearest neighbor algorithm. You use the testing data set to see the performance for different values of k.
The one with the best classification performance (and thus the lowest error rate) is the best k.
As you use the testing set for the improvement of your model, you should actually use a validation set instead. We did not learn how to do that though.
What do you do if you use k-nearest neighbor on numerical outcomes instead of categorical outcomes?
(So the class is not green or red, but a value)
You do everything the same, except you do not use majority voting for the class based on the neigbors.
-> you calculate the average of those neigbors value and then you take the average.
What are the shortcomings of this model?
- Computing the nearest neighbors can be time consuming.
- We only compute the distance at the time of prediction -> lazy learner.
- If predictors increase, the number of records in training set needs to increase exponentially.