Lecture 7 - K-Nearest Neighbors Flashcards

Question 1

Q

Classification

Answer

A

Examine data for which classification is unknown using data with known outcome
Goal is to predict that classification
Learn classification from the training data
- Relationship between predictors and outcome
Apply on testing data, which also includes known outcomes, suing the selected model finally
- Measure how well it will do on unknown data

Question 2

Q

Example classification

Question 3

Q

The k-Nearest Neighbors Classifier

Answer

A

Identify the neighbours of the new record that we wish to classify
- I.e., the k records in the training dataset that are similar to / close by the new record
Use the neighbours (i.e., these k records) to classify the new record into a class
Assign the new record to the predominant class among these neighbors

Question 4

Q

Steps in the k-Nearest Neighbors Classifier

Answer

A

Determining the item’s neighbors
Choosing the number of neighbors, i.e., value k
Computing classification (for a categorical outcome) or prediction (for a numerical outcome)

Question 5

Q

Determine record’s neighbors

Question 6

Q

Euclidean Distance

Question 7

Q

Example KNN quiz

Question 8

Q

Euclidean Distance

Answer

A

Highly scale dependent
- I.e., units of one variable can have a huge influence on the results, for example from cents to dollars
Solution is normalising the values before computing
This converts all measurements to the same scale
→ Subtract average and divide by standard deviation

Example

Average sales amount across 22 utitlies is 8914.045
Standard deviation is 3549.984
Sales for Airzona Public Service is 9077
Normalized sales is (9077-8914.045)/3549.984 = 0.046

Question 9

Q

Euclidean distance pt. 2

Question 10

Q

Choosing the value for k

Answer

A

k is too low: may be fitting to the noise in the dataset
k is too high: miss out on the method’s ability to capture the local structure in the dataset, one of its main advantages
k is the number of records in the training dataset: assign all records to the majority class in the training data
Balanced choice depends on the nature of the data
E.g., the more complex and irregular the structure of the data, the lower the optimum value of k
Typically: values of k fall in the range 1 to 20
Use odd numbers to avoid ties

Question 11

Q

How is k chosen?

Answer

A

We use the training data to classify the records in the testing dataset, i.e., use different values for k
Compute error rates for various choices of k
Choose k with the best classification performance

BUT

Testing dataset is now used as part of the training process (to set k)
We need a new dataset to evaluate the model performance on data that it dit not see

Question 12

Q

Numerical outcome

Answer

A

Algorithm can be extended to predict continuous values, instead of categorical values
First step remains unchanged i.e., determine neighbours by computing distances
Second step must be modified i.e., determining class through majority voting
Determine the prediction by taking the average outcome value of the k-nearest neighbors

Question 13

Q

Advantages

Answer

A

Simplicity of the method
Lack of parametric assumptions

Performs surprisingly well especially when

There is a large enough training set present
Each class is characterised by multiple combinations of predictor values

Question 14

Q

Shortcomings

Answer

A

Computing the nearest neighbours can be time consuming

Solutions: reduce time taken to compute distances by working on fewer dimensions, generated using dimension reduction techniques. Speed up identification of nearest neighbours using specialized data structures

For every record to be predicted, we compute its distance from the entire set of training records only at the time of prediction. Known as “lazy learner”

→ This behaviour prohibits using this algorithm for real-time prediction of al are number of records simultaneously

Number of records required in the training set to qualify as large, increases exponentially wit the number of predictors. Known as “curse of dimensionality”

Solution: Reduce the number of predictors

Question 15

Q

Classification with Python

Lecture 7 - K-Nearest Neighbors Flashcards

(15 cards)