Topic 5: Machine Learning: Classification & Clustering Flashcards

1
Q

Calculate the Euclidean distance

A

SQRT((Xb-Xa)2<span> </span>+ (Yb-Ya)2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Define nearest neighbors and combining function.

A

Nearest neighbours are the most similar instances; where combining function will give us a prediction (through voting/averaging).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain how combining functions can be used for classification.

A

looking at the nearest neighbours and using a combining function, such as majority vote, to determine which class the instance belongs to.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q
Calculate the probability of belonging to a class based on nearest
neighbor classification.
A

number of confirmations of belonging to that instance / total number of k instances

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain weighted voting (scoring) or similarity moderated voting (scoring)

A

Weighted scoring: influence of neighbours drops the further they’re from the instance

Similarity moderated voting:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Explain how k in k-NN can be used to address overfitting.

A

1-NN memorizes the training data (very complex model). To address overfitting try different values for k and choose the one that gives the best performance on training data and evaluate this on the test data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Discuss issues with nearest-neighbor methods with a focus on
• Intelligibility
• Dimensionality and domain knowledge
• Computational efficiency.

A
  • Intelligibility - if intelligebility and justification are critical, NN should be avoided
  • Dimensionality and domain knowledge - curse of dimensionality (all attributes add to the distance, not all attributes are relevant)
  • Computational efficiency - training is very fast, prediction/classification of new instance is very inneficient/costly.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe feature selection.

A

Selecting features that should be included in the model, can be done manually by someone with industry knowledge.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Define and discuss the curse of dimensionality.

A

Some features are irrelevant but all of the features add to the distance calculations (misleading and confusion of the model).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Calculate the Manhattan distance and the Cosine distance

A

Manhattan distance = (Xa-Xb)+(Ya-Yb)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Define the Jaccard distance

A

overlapping items / total unique items

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Calculate edit distance or the Levenshtein metric

A

Number of changes it takes to change one text into another by used three actions:

insert, modify, or delete. It is used when order is important.

CAT to FAT (one modify action)

LD = 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Define clustering, hierarchical clustering, and dendrogram

A

clustering: unsupervised segmentation

hierarchical clustering: overlap between clusters where one cluster contains other clusters

dendogram: hierarchy of the clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Describe how a dendrogram can help decide the number of clusters.

A

Horizontal lines can cut across at any point to get to the desired number of clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Describe the advantage of hierarchical clustering.

A

It allows you to see the groupings (ie landscape of data similarity)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Define linkage functions.

A

Distance functions between instances or clusters.

17
Q

Describe how distance measures can be used to decide the number of clusters
in a dendrogram.

A
  1. choosing the line which yields the most clusters and removes the longest distances
  2. very long distances are outliers (usually also its own cluster)
18
Q

Define “cluster center” or centroid and k-means clustering

A

cluster-center: geometric center of a group of instances

k-means clustering: means are the centroid (arithmetic mean) of the values along each dimension for instances in the cluster.

19
Q

Compare and contrast k-means clustering with hierarchical clustering

A

k-means starts with a desired number of k clusters

20
Q

Describe the k-means algorithm

A
  1. find the points closest to the chosen centers (often random)
  2. find the actual center of the clusters found in the first step
21
Q

Describe the reason for running the k-means algorithm many times.

A

result of a single clustering will find local optimum dependent on the initial centroid locations.

22
Q

Define a cluster’s distortion

A

sum of the squared differences between each data point and its corresponding centroid.

23
Q

Describe the method for selecting k in the k-means algorithm.

A
  1. experiment with different k-values
  2. graph various versions of K (elbow plot) and select the k where stabilization begins
24
Q

Define and calculate the accuracy and error rate

A

A general measure of classifier performance

Accuracy = Number of correct decisions / Total Number Of Decisions
(equal to 1-Error rate)

25
Q

Describe a confusion matrix.

A

Summary of prediction results on a classification problem (n x n matrix)

26
Q

Define false positives and false negatives

A

false positives (negative instances classified as positives)

27
Q

Describe unbalanced data and the problems with unbalanced data

A

unbalanced data -> where one class is rare evaluation based on accuracy breaks down

28
Q

Discuss the problems with unequal costs and benefits of errors.

A

Simple classification accuracy as metric makes no distinction between false positives and false negatives (they are equally important). Ideally you would estimate the cost or benefit for each decision a classifier can make.

29
Q

Calculate expected value and expected benefit.

A

expected benefit: probability_response(x) * (value response) + [1-probability_response]*(value no response)

30
Q

Describe how expected value can be used to frame classifier use

A

if the expected value is greater than 0 target the customer

31
Q

Describe how expected value can be used to frame classifier evaluation

A

you can use expected value to compare models.

32
Q

Define and interpret precision and recall.

A

precision: TP / (TP + FP) = how many times does the model correctly predict a cancer patients out of the total positive predictions.
recall: TP / FN -> how many times did the model correctly predict cancer patients out of the total number of cancer patients

33
Q

Calculate the value of the F-measure.

A

(precision * recall) / (precision + recall)