Question 1

What variables should always be excluded for machine learning?

Accepted Answer

ID variables

Question 2

Define k-NN

Accepted Answer

``` "Birds of a feather flock together" Nearest neighbour classifiers - classifies unlabelled examples by assigning them to the class of a similar labelled example ```

Question 3

How can we remove a variable?

Accepted Answer

Use subset or select() Can use a - sign to remove one column Then store in new object

Question 4

Name the fundamental steps in machine learning?

Accepted Answer

Exploring, preparing and transforming data (standardisation or normalisation; outcome variable) Training and testing the model Evaluating model performance Improving model performance

Question 5

What if the outcome variable is character strings?

Accepted Answer

If the outcome to be predicted is a character data type Convert this to factor

Question 6

Why use a normalisation function?

Accepted Answer

Rescales the features so they are on one common/standard scale

Question 7

How to create a function for normalisation?

Accepted Answer

Use the function() and return() Enter vectors into the function Save the newly created function into an object

Question 8

How will the normalised function be applied to the dataset?

Accepted Answer

``` It would normally enter the coordinates of one row at a time Use lapply() to apply to a wider range of selected values ```

Question 9

What are the steps for training, testing and machine learning?

Accepted Answer

Split data into two portions: (1) training set (2) test set

Remove outcome variable (label) from both sets (n.b. make sure you subset the same rows and columns)

Use knn()
 Visualise with CrossTable()

Question 10

What is the training set used for?

Accepted Answer

Supply the machine learning algorithm (KNN model) with labelled data, which includes the associated data characteristics.

Question 11

What is the testing set used for?

Accepted Answer

Will be used to estimate the predictive accuracy of the model Supply unlabelled data with data features to see how well the training model can predict the the labels for the unlabelled data

Question 12

What is the purpose of k-NN

Accepted Answer

Seeing how similar unlabelled data (testing set) compares to labelled data (training set)

Question 13

How could we train a set in reality?

Accepted Answer

Compare the new unlabelled data to pre-existing labelled data Using historic labelled data to compare the features of unlabelled data

Question 14

What is test and train sets? Explain how it works?

Accepted Answer

Training set is used to build the kNN model
Test dataset is used to estimate the predictive accuracy of the model
We use the train as a library or resource of variables and their characteristics, this is used as the reference for the new test data
The test data is then supplied into this train set and it will predict based on the information the training set already has

Question 15

What are nearest neighbours?

Accepted Answer

A parameter for the number of nearby variables you count The counts go towards a vote (e.g. near 2 fruit and 1 protein means the unlabelled variable is classified as a fruit)

L2 - k-NN Flashcards

(25 cards)