Data Mining - Chapter 7 (k-Nearest Neighbors) Flashcards

Question 1

Q

What is k-Nearest Neighbor?

Answer

A

It is an algorithm that can be used for classification of a categorical outcome or prediction of a numerical outcome.

-> It works by finding similar records in the training data to classify the new record to their group.

Question 2

Q

What is voting in k-nearest neighbor?

Answer

A

If you have a specific k of neighbours, you look what their category is. Based on the majority of their outcomes, you set the category of the new record.

Question 3

Q

What kind of method is the k-nearest neighbor?

Answer

A

It is a nonparametric method. That means it does not make any assumptions about the form of the relationship between the dependent variable and the predictor variables.
-> It looks at similarities between the predictor values of the records in the dataset.

Question 4

Q

What is the process of the k-Nearest Neighbor?

Answer

A

Determine the item’s neighbors
Choosing the number of neighbors you want to include - choosing the k
Computing the classification.

Question 5

Q

How do you determine a records neighbors?

Answer

A

It is based on similarity / closeness between records.

This is often called the distance between records.

Question 6

Q

How do you note the distance between records?

Answer

A

Record i = ri
Record j = rj
Distance between the two records: dij

-> So you check the distance between two records (rows). They can have multiple predictor variables per record.

Question 7

Q

What are the properties of distances?

Answer

A

P1: The distance is non-negative: dij > 0

P2: Self-proximity : Dii = 0

P3: Symmetry: dij = dji

P4: Triangle inequality: dij <= dik + dkj
(distance between any pair cannot exceed the sum of distances between the other two pairs)

Question 8

Q

What is the Euclidean distance and how does it work?

Answer

A

Most popular distance measure for numerical values.

dij = √ ((Xi1 - Xj1)^2 + (Xi2 - Xj2)^2 +…+ (Xip -Xjp)^2)

Question 9

Q

What are some remarks on the Euclidean distance?

Answer

A

Highly scale dependent
Units of one variable can have a huge influence on the results.
-> Therefore it is sometimes needed to standardize the values.
Sensitive to outliers.

Question 10

Q

What is the Manhattan distance?

Answer

A

A more robust distance than the Euclidean distance. It looks at the absolute differences instead of the squared differences.

dij = |Xi1 - Xj1| + |Xi2 - Xj2| +…+ |Xip - Xjp|

Question 11

Q

What happens if K is too low?

Answer

A

We may be fitting to the noise (useless data) in the dataset. This can lead to overfitting.

Question 12

Q

What happens if K is too high?

Answer

A

You miss out on capturing the local structure of the data –> you are just gonna average too much, reducing meaning of the classifications.

Question 13

Q

What happens if K = n ?

Answer

A

We assign all the records to the majority class –> oversmoothing.

Question 14

Q

How do you decide on the number of k in a python model?

Answer

A

Make the nearest K model.
Make it for all the K’s you want to try
For every model, compute the accuracy score (accuracy_score(test_y, predict.model(test_x)))

Question 15

Q

What changes if you use K nearest neighbors for numerical outcomes?

Answer

A

Everything stays the same. Except you do not determine class through majority voting but by taking the average outcome value of the k-nearest neighbors.

Question 16

Q

What are the benefits of K-nearest neigbors?

Answer

Study These Flashcards

A

Simplicity of the model
There are no parametric assumptions
They do well in case of large trainingsets, especially when each class is characterized by multiple combinations of predictor values

Question 17

Q

What are the disadvantages of k-nearest neighbors?

Answer

Study These Flashcards

A

Computing the nearest neighbors can be time consuming
It is a lazy learner - we only compare at the time of prediction to the training set. We do not incorporate real-time data
Curse if dimensionality - The number of records required in the trainingset increases exponentially when you increase the number of predictors

Question 18

Q

How can we reduce the computation time of the nearest neigbors?

Answer

Study These Flashcards

A

Work on fewer dimensions, using dimension reduction techniques such as principal components analysis
Speed up identification of nearest neighbors using
specialized data structures

Question 19

Q

How can we deal with the curse of dimensionality?

Answer

Study These Flashcards

A

Reduce the number of predictor variables.

Data Mining - Chapter 7 (k-Nearest Neighbors) Flashcards

(19 cards)