Week 3 Flashcards
1
Q
Nearest Neighbor Algorithm
A
- Nearest Neighbor classifiers are defined by their characteristics of classifying unlabeled cases based on the similarity of labeled cases
- for example, if user A and B exhibit the same purchasing behavior the items purchased by A are also recommended to user B.
- Nearest neighbor algorithms are used for
- recommendations
- identifying patterns in data
- suited when the data type is homogenous
2
Q
K-NN
A
- k-NN stands for K-Nearest Neighbor Algorithm
- the simple machine learning algorithm
3
Q
Strengths of KNN
A
- simple and effecctive
- makes no assumptions about the underlying data distribution
- fast training phase
4
Q
Weakness of KNN
A
- does not produce a model, limiting the ability to understand how the feature is related to the class
- requires selection of an appropriate k
- slow classification phase
- nominal features and missing data required additional processing
5
Q
Letter K in K-NN Algorithm
A
- the letter is K is a variable term is implying the numbers of neighbors that could be used.
- for each unlabeled record in the test dataset, k-NN identifies k records in the training data that are the “nearest” in similarity
6
Q
Euclidean Distance
A
- Euclidean distance is used to measure the similarity between two instances
- where p and q are the examples to be compared
- each having n features
- the term p1 refers to the value of the first feature of example p, while q1 refers to the value of the first feature of example q
7
Q
Choosing a value for K in K-NN
A
- k value determines how well the model will generalize to future data
- the balance between over lifting and underlighting the training data is a problem known as bias-variance tradeoff
- choosing a large k reduces the impact or variance caused by noisy data, but can bias the learner so that it runs the risk of ignoring small but important patterns
- choosing smaller values for K can be noisy and will have a higher influence on the result.
8
Q
commonly used techniques
A
k=sqrt(N) where N stands for the number of samples in your training dataset