L2 - k-NN Flashcards
What variables should always be excluded for machine learning?
ID variables
Define k-NN
"Birds of a feather flock together" Nearest neighbour classifiers - classifies unlabelled examples by assigning them to the class of a similar labelled example
How can we remove a variable?
Use subset or select()
Can use a - sign to remove one column
Then store in new object
Name the fundamental steps in machine learning?
Exploring, preparing and transforming data (standardisation or normalisation; outcome variable)
Training and testing the model
Evaluating model performance
Improving model performance
What if the outcome variable is character strings?
If the outcome to be predicted is a character data type
Convert this to factor
Why use a normalisation function?
Rescales the features so they are on one common/standard scale
How to create a function for normalisation?
Use the function() and return()
Enter vectors into the function
Save the newly created function into an object
How will the normalised function be applied to the dataset?
It would normally enter the coordinates of one row at a time Use lapply() to apply to a wider range of selected values
What are the steps for training, testing and machine learning?
Split data into two portions: (1) training set (2) test set
Remove outcome variable (label) from both sets (n.b. make sure you subset the same rows and columns)
Use knn() Visualise with CrossTable()
What is the training set used for?
Supply the machine learning algorithm (KNN model) with labelled data, which includes the associated data characteristics.
What is the testing set used for?
Will be used to estimate the predictive accuracy of the model
Supply unlabelled data with data features to see how well the training model can predict the the labels for the unlabelled data
What is the purpose of k-NN
Seeing how similar unlabelled data (testing set) compares to labelled data (training set)
How could we train a set in reality?
Compare the new unlabelled data to pre-existing labelled data
Using historic labelled data to compare the features of unlabelled data
What is test and train sets? Explain how it works?
Training set is used to build the kNN model
Test dataset is used to estimate the predictive accuracy of the model
We use the train as a library or resource of variables and their characteristics, this is used as the reference for the new test data
The test data is then supplied into this train set and it will predict based on the information the training set already has
What are nearest neighbours?
A parameter for the number of nearby variables you count
The counts go towards a vote (e.g. near 2 fruit and 1 protein means the unlabelled variable is classified as a fruit)
Can we use a z normalisation?
#Although normalization is traditionally used for kNN classification, #it may not always be the most appropriate way to rescale features. #Because z-score standardized values have no predefined minimum and maximum, #extreme values are not compressed towards the center.
Explain k-NN using a simple table with labels?
Data sample (e.g. ingredient) Feature/variable 1 (e.g. sweetness) Feature/variable 2 (e.g. crunch) Class label (food type)
Explain k-NN using a two axis diagram?
The two features will be on either axis
Clusters are drawn around observations based on their features
Explain nearest neighbours with a two axis diagram?
The new unlabelled data will be classified according to the number of k nearest neighbours
If the 3 nearest neighbours are all fruit then the unlabelled observation will now be classified as fruit
Explain distance/similarity measures? What is the best method?
Different measures navigate multidimensional spaces differently
Still debated in the academia; select distance measure based on appropriateness.
What is the Euclidean distance formula? Explain how Euclidean distance works with the formula?
√(p1-q2)^2 + (p2-q2)^2
P1 and p2 are the two features
Each observation has a separate calculation
Apple (Sweet 5 and Crunch 9) Grape (Sweet 9 and Crunch 4)
(5-9)^2 + (9-4)^2
Each feature is subtracted by the related feature
This gives a distance
Why does distance matter with k-NN
The closer the distance after using the distance algorithm the nearer the neighbour
Closest neighbours get the “vote”
Pros and cons of k-NN
Pros:
+ Simple and effective - used as a baseline (used to improve models)
+ Makes no assumptions about the distribution (non-parametric)
+ Fast training phase
Cons:
- Requires selection of appropriate k
- Slow classification (testing phase)
- Nominal features require additional processing
- Euclidean distance does not work on categorical data (must be coded)
-
What is scaling in k-NN
Part of data processing
Put the data on a common scale
Min-max normalisation
Z-score standardisation
How to code dummy variables
Binary - 0 and 1
Nominal - Combination of 0 and 1 (e.g. n - 1)
Ordinal - ordered numbers (but only used if intervals between categories are equal)