Lecture 7 - K-Nearest Neighbors Flashcards

1
Q

Classification

A
  • Examine data for which classification is unknown using data with known outcome
  • Goal is to predict that classification
  • Learn classification from the training data
    • Relationship between predictors and outcome
  • Apply on testing data, which also includes known outcomes, suing the selected model finally
    • Measure how well it will do on unknown data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Example classification

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

The k-Nearest Neighbors Classifier

A
  • Identify the neighbours of the new record that we wish to classify
    • I.e., the k records in the training dataset that are similar to / close by the new record
  • Use the neighbours (i.e., these k records) to classify the new record into a class
  • Assign the new record to the predominant class among these neighbors
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Steps in the k-Nearest Neighbors Classifier

A
  • Determining the item’s neighbors
  • Choosing the number of neighbors, i.e., value k
  • Computing classification (for a categorical outcome) or prediction (for a numerical outcome)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Determine record’s neighbors

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Euclidean Distance

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Example KNN quiz

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Euclidean Distance

A
  • Highly scale dependent
    • I.e., units of one variable can have a huge influence on the results, for example from cents to dollars
  • Solution is normalising the values before computing
  • This converts all measurements to the same scale
  • → Subtract average and divide by standard deviation

Example

  • Average sales amount across 22 utitlies is 8914.045
  • Standard deviation is 3549.984
  • Sales for Airzona Public Service is 9077
  • Normalized sales is (9077-8914.045)/3549.984 = 0.046
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Euclidean distance pt. 2

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Choosing the value for k

A
  • k is too low: may be fitting to the noise in the dataset
  • k is too high: miss out on the method’s ability to capture the local structure in the dataset, one of its main advantages
  • k is the number of records in the training dataset: assign all records to the majority class in the training data
  • Balanced choice depends on the nature of the data
  • E.g., the more complex and irregular the structure of the data, the lower the optimum value of k
  • Typically: values of k fall in the range 1 to 20
  • Use odd numbers to avoid ties
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How is k chosen?

A
  • We use the training data to classify the records in the testing dataset, i.e., use different values for k
  • Compute error rates for various choices of k
  • Choose k with the best classification performance

BUT

  • Testing dataset is now used as part of the training process (to set k)
  • We need a new dataset to evaluate the model performance on data that it dit not see
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Numerical outcome

A
  • Algorithm can be extended to predict continuous values, instead of categorical values
  • First step remains unchanged i.e., determine neighbours by computing distances
  • Second step must be modified i.e., determining class through majority voting
  • Determine the prediction by taking the average outcome value of the k-nearest neighbors
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Advantages

A
  • Simplicity of the method
  • Lack of parametric assumptions

Performs surprisingly well especially when

  • There is a large enough training set present
  • Each class is characterised by multiple combinations of predictor values
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Shortcomings

A
  1. Computing the nearest neighbours can be time consuming

Solutions: reduce time taken to compute distances by working on fewer dimensions, generated using dimension reduction techniques. Speed up identification of nearest neighbours using specialized data structures

  1. For every record to be predicted, we compute its distance from the entire set of training records only at the time of prediction. Known as “lazy learner”

→ This behaviour prohibits using this algorithm for real-time prediction of al are number of records simultaneously

  1. Number of records required in the training set to qualify as large, increases exponentially wit the number of predictors. Known as “curse of dimensionality”

Solution: Reduce the number of predictors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Classification with Python

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly