L2 - k-NN Flashcards

1
Q

What variables should always be excluded for machine learning?

A

ID variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Define k-NN

A
"Birds of a feather flock together"
 Nearest neighbour classifiers - classifies unlabelled examples by assigning them to the class of a similar labelled example
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How can we remove a variable?

A

Use subset or select()
Can use a - sign to remove one column
Then store in new object

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Name the fundamental steps in machine learning?

A

Exploring, preparing and transforming data (standardisation or normalisation; outcome variable)
Training and testing the model
Evaluating model performance
Improving model performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What if the outcome variable is character strings?

A

If the outcome to be predicted is a character data type

Convert this to factor

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why use a normalisation function?

A

Rescales the features so they are on one common/standard scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How to create a function for normalisation?

A

Use the function() and return()
Enter vectors into the function
Save the newly created function into an object

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How will the normalised function be applied to the dataset?

A
It would normally enter the coordinates of one row at a time
 Use lapply() to apply to a wider range of selected values
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the steps for training, testing and machine learning?

A

Split data into two portions: (1) training set (2) test set

Remove outcome variable (label) from both sets (n.b. make sure you subset the same rows and columns)

Use knn()
 Visualise with CrossTable()
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the training set used for?

A

Supply the machine learning algorithm (KNN model) with labelled data, which includes the associated data characteristics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the testing set used for?

A

Will be used to estimate the predictive accuracy of the model

Supply unlabelled data with data features to see how well the training model can predict the the labels for the unlabelled data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the purpose of k-NN

A

Seeing how similar unlabelled data (testing set) compares to labelled data (training set)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How could we train a set in reality?

A

Compare the new unlabelled data to pre-existing labelled data

Using historic labelled data to compare the features of unlabelled data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is test and train sets? Explain how it works?

A

Training set is used to build the kNN model
Test dataset is used to estimate the predictive accuracy of the model
We use the train as a library or resource of variables and their characteristics, this is used as the reference for the new test data
The test data is then supplied into this train set and it will predict based on the information the training set already has

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are nearest neighbours?

A

A parameter for the number of nearby variables you count

The counts go towards a vote (e.g. near 2 fruit and 1 protein means the unlabelled variable is classified as a fruit)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Can we use a z normalisation?

A
#Although normalization is traditionally used for kNN classification, 
 #it may not always be the most appropriate way to rescale features. 
 #Because z-score standardized values have no predefined minimum and maximum, 
 #extreme values are not compressed towards the center.
17
Q

Explain k-NN using a simple table with labels?

A
Data sample (e.g. ingredient)
 Feature/variable 1 (e.g. sweetness)
 Feature/variable 2 (e.g. crunch)
 Class label (food type)
18
Q

Explain k-NN using a two axis diagram?

A

The two features will be on either axis

Clusters are drawn around observations based on their features

19
Q

Explain nearest neighbours with a two axis diagram?

A

The new unlabelled data will be classified according to the number of k nearest neighbours
If the 3 nearest neighbours are all fruit then the unlabelled observation will now be classified as fruit

20
Q

Explain distance/similarity measures? What is the best method?

A

Different measures navigate multidimensional spaces differently
Still debated in the academia; select distance measure based on appropriateness.

21
Q

What is the Euclidean distance formula? Explain how Euclidean distance works with the formula?

A

√(p1-q2)^2 + (p2-q2)^2
P1 and p2 are the two features
Each observation has a separate calculation
Apple (Sweet 5 and Crunch 9) Grape (Sweet 9 and Crunch 4)
(5-9)^2 + (9-4)^2
Each feature is subtracted by the related feature
This gives a distance

22
Q

Why does distance matter with k-NN

A

The closer the distance after using the distance algorithm the nearer the neighbour
Closest neighbours get the “vote”

23
Q

Pros and cons of k-NN

A

Pros:
+ Simple and effective - used as a baseline (used to improve models)
+ Makes no assumptions about the distribution (non-parametric)
+ Fast training phase
Cons:
- Requires selection of appropriate k
- Slow classification (testing phase)
- Nominal features require additional processing
- Euclidean distance does not work on categorical data (must be coded)
-

24
Q

What is scaling in k-NN

A

Part of data processing
Put the data on a common scale
Min-max normalisation
Z-score standardisation

25
Q

How to code dummy variables

A

Binary - 0 and 1
Nominal - Combination of 0 and 1 (e.g. n - 1)
Ordinal - ordered numbers (but only used if intervals between categories are equal)