Topic 5: Machine Learning: Classification & Clustering Flashcards

Question 1

Q

Calculate the Euclidean distance

Answer

A

SQRT((X_b-X_a)^{2<span> </span>}+ (Y_b-Y_a)²)

Question 2

Q

Define nearest neighbors and combining function.

Answer

A

Nearest neighbours are the most similar instances; where combining function will give us a prediction (through voting/averaging).

Question 3

Q

Explain how combining functions can be used for classification.

Answer

A

looking at the nearest neighbours and using a combining function, such as majority vote, to determine which class the instance belongs to.

Question 4

Q

Calculate the probability of belonging to a class based on nearest
neighbor classification.

Answer

A

number of confirmations of belonging to that instance / total number of k instances

Question 5

Q

Explain weighted voting (scoring) or similarity moderated voting (scoring)

Answer

A

Weighted scoring: influence of neighbours drops the further they’re from the instance

Similarity moderated voting:

Question 6

Q

Explain how k in k-NN can be used to address overfitting.

Answer

A

1-NN memorizes the training data (very complex model). To address overfitting try different values for k and choose the one that gives the best performance on training data and evaluate this on the test data.

Question 7

Q

Discuss issues with nearest-neighbor methods with a focus on
• Intelligibility
• Dimensionality and domain knowledge
• Computational efficiency.

Answer

A

Intelligibility - if intelligebility and justification are critical, NN should be avoided
Dimensionality and domain knowledge - curse of dimensionality (all attributes add to the distance, not all attributes are relevant)
Computational efficiency - training is very fast, prediction/classification of new instance is very inneficient/costly.

Question 8

Q

Describe feature selection.

Answer

A

Selecting features that should be included in the model, can be done manually by someone with industry knowledge.

Question 9

Q

Define and discuss the curse of dimensionality.

Answer

A

Some features are irrelevant but all of the features add to the distance calculations (misleading and confusion of the model).

Question 10

Q

Calculate the Manhattan distance and the Cosine distance

Answer

A

Manhattan distance = (X_a-X_b)+(Y_a-Y_b)

Question 11

Q

Define the Jaccard distance

Answer

A

overlapping items / total unique items

Question 12

Q

Calculate edit distance or the Levenshtein metric

Answer

A

Number of changes it takes to change one text into another by used three actions:

insert, modify, or delete. It is used when order is important.

CAT to FAT (one modify action)

LD = 1

Question 13

Q

Define clustering, hierarchical clustering, and dendrogram

Answer

A

clustering: unsupervised segmentation

hierarchical clustering: overlap between clusters where one cluster contains other clusters

dendogram: hierarchy of the clusters

Question 14

Q

Describe how a dendrogram can help decide the number of clusters.

Answer

A

Horizontal lines can cut across at any point to get to the desired number of clusters.

Question 15

Q

Describe the advantage of hierarchical clustering.

Answer

A

It allows you to see the groupings (ie landscape of data similarity)

Question 16

Q

Define linkage functions.

Answer

A

Distance functions between instances or clusters.

Question 17

Q

Describe how distance measures can be used to decide the number of clusters
in a dendrogram.

Answer

A

choosing the line which yields the most clusters and removes the longest distances
very long distances are outliers (usually also its own cluster)

Question 18

Q

Define “cluster center” or centroid and k-means clustering

Answer

A

cluster-center: geometric center of a group of instances

k-means clustering: means are the centroid (arithmetic mean) of the values along each dimension for instances in the cluster.

Question 19

Q

Compare and contrast k-means clustering with hierarchical clustering

Answer

A

k-means starts with a desired number of k clusters

Question 20

Q

Describe the k-means algorithm

Answer

A

find the points closest to the chosen centers (often random)
find the actual center of the clusters found in the first step

Question 21

Q

Describe the reason for running the k-means algorithm many times.

Answer

A

result of a single clustering will find local optimum dependent on the initial centroid locations.

Question 22

Q

Define a cluster’s distortion

Answer

A

sum of the squared differences between each data point and its corresponding centroid.

Question 23

Q

Describe the method for selecting k in the k-means algorithm.

Answer

A

experiment with different k-values
graph various versions of K (elbow plot) and select the k where stabilization begins

Question 24

Q

Define and calculate the accuracy and error rate

Answer

A

A general measure of classifier performance

Accuracy = Number of correct decisions / Total Number Of Decisions
(equal to 1-Error rate)

Question 25

Q

Describe a confusion matrix.

Answer

A

Summary of prediction results on a classification problem (n x n matrix)

Question 26

Q

Define false positives and false negatives

Answer

A

false positives (negative instances classified as positives)

Question 27

Q

Describe unbalanced data and the problems with unbalanced data

Answer

A

unbalanced data -> where one class is rare evaluation based on accuracy breaks down

Question 28

Q

Discuss the problems with unequal costs and benefits of errors.

Answer

A

Simple classification accuracy as metric makes no distinction between false positives and false negatives (they are equally important). Ideally you would estimate the cost or benefit for each decision a classifier can make.

Question 29

Q

Calculate expected value and expected benefit.

Answer

A

expected benefit: probability_response(x) * (value response) + [1-probability_response]*(value no response)

Question 30

Q

Describe how expected value can be used to frame classifier use

Answer

A

if the expected value is greater than 0 target the customer

Question 31

Q

Describe how expected value can be used to frame classifier evaluation

Answer

A

you can use expected value to compare models.

Question 32

Q

Define and interpret precision and recall.

Answer

A

precision: TP / (TP + FP) = how many times does the model correctly predict a cancer patients out of the total positive predictions.
recall: TP / FN -> how many times did the model correctly predict cancer patients out of the total number of cancer patients

Question 33

Q

Calculate the value of the F-measure.

Answer

A

(precision * recall) / (precision + recall)