Week 3 Flashcards

1
Q

Nearest Neighbor Algorithm

A
  • Nearest Neighbor classifiers are defined by their characteristics of classifying unlabeled cases based on the similarity of labeled cases
  • for example, if user A and B exhibit the same purchasing behavior the items purchased by A are also recommended to user B.
  • Nearest neighbor algorithms are used for
    • recommendations
    • identifying patterns in data
  • suited when the data type is homogenous
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

K-NN

A
  • k-NN stands for K-Nearest Neighbor Algorithm
  • the simple machine learning algorithm
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Strengths of KNN

A
  • simple and effecctive
  • makes no assumptions about the underlying data distribution
  • fast training phase
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Weakness of KNN

A
  • does not produce a model, limiting the ability to understand how the feature is related to the class
  • requires selection of an appropriate k
  • slow classification phase
  • nominal features and missing data required additional processing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Letter K in K-NN Algorithm

A
  • the letter is K is a variable term is implying the numbers of neighbors that could be used.
  • for each unlabeled record in the test dataset, k-NN identifies k records in the training data that are the “nearest” in similarity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Euclidean Distance

A
  • Euclidean distance is used to measure the similarity between two instances
  • where p and q are the examples to be compared
  • each having n features
  • the term p1 refers to the value of the first feature of example p, while q1 refers to the value of the first feature of example q
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Choosing a value for K in K-NN

A
  • k value determines how well the model will generalize to future data
  • the balance between over lifting and underlighting the training data is a problem known as bias-variance tradeoff
  • choosing a large k reduces the impact or variance caused by noisy data, but can bias the learner so that it runs the risk of ignoring small but important patterns
  • choosing smaller values for K can be noisy and will have a higher influence on the result.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

commonly used techniques

A

k=sqrt(N) where N stands for the number of samples in your training dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly