Topic 5: Machine Learning: Classification & Clustering Flashcards
Calculate the Euclidean distance
SQRT((Xb-Xa)2<span> </span>+ (Yb-Ya)2)
Define nearest neighbors and combining function.
Nearest neighbours are the most similar instances; where combining function will give us a prediction (through voting/averaging).
Explain how combining functions can be used for classification.
looking at the nearest neighbours and using a combining function, such as majority vote, to determine which class the instance belongs to.
Calculate the probability of belonging to a class based on nearest neighbor classification.
number of confirmations of belonging to that instance / total number of k instances
Explain weighted voting (scoring) or similarity moderated voting (scoring)
Weighted scoring: influence of neighbours drops the further they’re from the instance
Similarity moderated voting:
Explain how k in k-NN can be used to address overfitting.
1-NN memorizes the training data (very complex model). To address overfitting try different values for k and choose the one that gives the best performance on training data and evaluate this on the test data.
Discuss issues with nearest-neighbor methods with a focus on
• Intelligibility
• Dimensionality and domain knowledge
• Computational efficiency.
- Intelligibility - if intelligebility and justification are critical, NN should be avoided
- Dimensionality and domain knowledge - curse of dimensionality (all attributes add to the distance, not all attributes are relevant)
- Computational efficiency - training is very fast, prediction/classification of new instance is very inneficient/costly.
Describe feature selection.
Selecting features that should be included in the model, can be done manually by someone with industry knowledge.
Define and discuss the curse of dimensionality.
Some features are irrelevant but all of the features add to the distance calculations (misleading and confusion of the model).
Calculate the Manhattan distance and the Cosine distance
Manhattan distance = (Xa-Xb)+(Ya-Yb)
Define the Jaccard distance
overlapping items / total unique items
Calculate edit distance or the Levenshtein metric
Number of changes it takes to change one text into another by used three actions:
insert, modify, or delete. It is used when order is important.
CAT to FAT (one modify action)
LD = 1
Define clustering, hierarchical clustering, and dendrogram
clustering: unsupervised segmentation
hierarchical clustering: overlap between clusters where one cluster contains other clusters
dendogram: hierarchy of the clusters
Describe how a dendrogram can help decide the number of clusters.
Horizontal lines can cut across at any point to get to the desired number of clusters.
Describe the advantage of hierarchical clustering.
It allows you to see the groupings (ie landscape of data similarity)