L2 - k-NN Flashcards
What variables should always be excluded for machine learning?
ID variables
Define k-NN
"Birds of a feather flock together" Nearest neighbour classifiers - classifies unlabelled examples by assigning them to the class of a similar labelled example
How can we remove a variable?
Use subset or select()
Can use a - sign to remove one column
Then store in new object
Name the fundamental steps in machine learning?
Exploring, preparing and transforming data (standardisation or normalisation; outcome variable)
Training and testing the model
Evaluating model performance
Improving model performance
What if the outcome variable is character strings?
If the outcome to be predicted is a character data type
Convert this to factor
Why use a normalisation function?
Rescales the features so they are on one common/standard scale
How to create a function for normalisation?
Use the function() and return()
Enter vectors into the function
Save the newly created function into an object
How will the normalised function be applied to the dataset?
It would normally enter the coordinates of one row at a time Use lapply() to apply to a wider range of selected values
What are the steps for training, testing and machine learning?
Split data into two portions: (1) training set (2) test set
Remove outcome variable (label) from both sets (n.b. make sure you subset the same rows and columns)
Use knn() Visualise with CrossTable()
What is the training set used for?
Supply the machine learning algorithm (KNN model) with labelled data, which includes the associated data characteristics.
What is the testing set used for?
Will be used to estimate the predictive accuracy of the model
Supply unlabelled data with data features to see how well the training model can predict the the labels for the unlabelled data
What is the purpose of k-NN
Seeing how similar unlabelled data (testing set) compares to labelled data (training set)
How could we train a set in reality?
Compare the new unlabelled data to pre-existing labelled data
Using historic labelled data to compare the features of unlabelled data
What is test and train sets? Explain how it works?
Training set is used to build the kNN model
Test dataset is used to estimate the predictive accuracy of the model
We use the train as a library or resource of variables and their characteristics, this is used as the reference for the new test data
The test data is then supplied into this train set and it will predict based on the information the training set already has
What are nearest neighbours?
A parameter for the number of nearby variables you count
The counts go towards a vote (e.g. near 2 fruit and 1 protein means the unlabelled variable is classified as a fruit)