Topic 5: Machine Learning: Classification & Clustering Flashcards
Calculate the Euclidean distance
SQRT((Xb-Xa)2<span> </span>+ (Yb-Ya)2)
Define nearest neighbors and combining function.
Nearest neighbours are the most similar instances; where combining function will give us a prediction (through voting/averaging).
Explain how combining functions can be used for classification.
looking at the nearest neighbours and using a combining function, such as majority vote, to determine which class the instance belongs to.
Calculate the probability of belonging to a class based on nearest neighbor classification.
number of confirmations of belonging to that instance / total number of k instances
Explain weighted voting (scoring) or similarity moderated voting (scoring)
Weighted scoring: influence of neighbours drops the further they’re from the instance
Similarity moderated voting:
Explain how k in k-NN can be used to address overfitting.
1-NN memorizes the training data (very complex model). To address overfitting try different values for k and choose the one that gives the best performance on training data and evaluate this on the test data.
Discuss issues with nearest-neighbor methods with a focus on
• Intelligibility
• Dimensionality and domain knowledge
• Computational efficiency.
- Intelligibility - if intelligebility and justification are critical, NN should be avoided
- Dimensionality and domain knowledge - curse of dimensionality (all attributes add to the distance, not all attributes are relevant)
- Computational efficiency - training is very fast, prediction/classification of new instance is very inneficient/costly.
Describe feature selection.
Selecting features that should be included in the model, can be done manually by someone with industry knowledge.
Define and discuss the curse of dimensionality.
Some features are irrelevant but all of the features add to the distance calculations (misleading and confusion of the model).
Calculate the Manhattan distance and the Cosine distance
Manhattan distance = (Xa-Xb)+(Ya-Yb)
Define the Jaccard distance
overlapping items / total unique items

Calculate edit distance or the Levenshtein metric
Number of changes it takes to change one text into another by used three actions:
insert, modify, or delete. It is used when order is important.
CAT to FAT (one modify action)
LD = 1
Define clustering, hierarchical clustering, and dendrogram
clustering: unsupervised segmentation
hierarchical clustering: overlap between clusters where one cluster contains other clusters
dendogram: hierarchy of the clusters
Describe how a dendrogram can help decide the number of clusters.
Horizontal lines can cut across at any point to get to the desired number of clusters.
Describe the advantage of hierarchical clustering.
It allows you to see the groupings (ie landscape of data similarity)
Define linkage functions.
Distance functions between instances or clusters.
Describe how distance measures can be used to decide the number of clusters
in a dendrogram.
- choosing the line which yields the most clusters and removes the longest distances
- very long distances are outliers (usually also its own cluster)
Define “cluster center” or centroid and k-means clustering
cluster-center: geometric center of a group of instances
k-means clustering: means are the centroid (arithmetic mean) of the values along each dimension for instances in the cluster.
Compare and contrast k-means clustering with hierarchical clustering
k-means starts with a desired number of k clusters
Describe the k-means algorithm
- find the points closest to the chosen centers (often random)
- find the actual center of the clusters found in the first step
Describe the reason for running the k-means algorithm many times.
result of a single clustering will find local optimum dependent on the initial centroid locations.
Define a cluster’s distortion
sum of the squared differences between each data point and its corresponding centroid.
Describe the method for selecting k in the k-means algorithm.
- experiment with different k-values
- graph various versions of K (elbow plot) and select the k where stabilization begins
Define and calculate the accuracy and error rate
A general measure of classifier performance
Accuracy = Number of correct decisions / Total Number Of Decisions
(equal to 1-Error rate)
Describe a confusion matrix.
Summary of prediction results on a classification problem (n x n matrix)
Define false positives and false negatives
false positives (negative instances classified as positives)
Describe unbalanced data and the problems with unbalanced data
unbalanced data -> where one class is rare evaluation based on accuracy breaks down
Discuss the problems with unequal costs and benefits of errors.
Simple classification accuracy as metric makes no distinction between false positives and false negatives (they are equally important). Ideally you would estimate the cost or benefit for each decision a classifier can make.
Calculate expected value and expected benefit.
expected benefit: probability_response(x) * (value response) + [1-probability_response]*(value no response)
Describe how expected value can be used to frame classifier use
if the expected value is greater than 0 target the customer
Describe how expected value can be used to frame classifier evaluation
you can use expected value to compare models.
Define and interpret precision and recall.
precision: TP / (TP + FP) = how many times does the model correctly predict a cancer patients out of the total positive predictions.
recall: TP / FN -> how many times did the model correctly predict cancer patients out of the total number of cancer patients
Calculate the value of the F-measure.
(precision * recall) / (precision + recall)