Chapter 6 Flashcards
What are the use case for similarity?
- Similarity use cases: classification & regression
- Recommendation eCommerce: similarity
How can nearest neighbours be used for predictive modelling in the case of classification, probability estimation and regression?
Classification: using the smallest distance of the NN
Probability estimation: using scores
Regression: using the average or the median
How many neighbours are needed and how can you solve the issue of points further away having the same influence as those close by?
- Nearest neighbor algorithms: k-NN (k = # of neighbors)
- The greater k, the more estimates are smoothed out among the neighbors
- No strict rules for choosing the # of neighbors
- Odd number is useful to yield a clear majority
- Taking into account the distance of a neighbor to the target instance
- Weighted voting / Similarity moderated voting: eliminates the risk of even outcomes by using weighted voting or similarity moderated voting, such that each neighbors contribution is scaled by its similarity
- Contribution of a neighbor drops, the further away they are from the instance
How does kNN overfitting vary with k?
- Boundaries are no lines, no recognizable pattern = random
- 1-NN: predicts perfectly for all training data and reasonably well for new ones (based on similar observations in the training data)
- k-NN = complexity parameter
- Low complexity: k = n
- Predicts the average value in the dataset for each case
- If k = n : takes into account all instances = no complexity
- If k = 30: takes into account 30 closest neighbors
- High complexity: k = 1
- Complicated boundaries, every training example will have its own boundary
- Low complexity: k = n
- k-NN = complexity parameter
What can you use to choose the best value for k?
- Choosing perfect level of k with:
- Cross-validation
- Nested Holdout testing on training set
What are the issues with nearest neighbour methods?
- If model intelligibility & justification are critical, then nearest-neighbor methods should be avoided!
- With complex & heterogenous (differing) attributes things become more complex (scale etc.)
- Must be taken care, that similarity / distance computation is meaningful for application
-
Intelligibility of the entire model
- Use of the model depends on the case
-
Issues with Justification of a specific decision
- explanation why & how recommendation was made can’t always be given.
- If adequate depends on the case at hand (Netflix vs. credit denial)
-
Dimensionality & domain knowledge
-
Curse of dimensionality: Too many and irrelevant attributes (sometimes with vastly different numeric values) contribute to calculating the distance
- Solved by feature selection: mostly manually by data miner
- Solved by tuning distance function: assigning different weights to the attributes
-
Curse of dimensionality: Too many and irrelevant attributes (sometimes with vastly different numeric values) contribute to calculating the distance
-
Computational efficiency
- Some application require extremely fast predictions
- Nearest Neighbout not ideally suited for those applications
Describe Hierarchical Clustering
- Focuses on the similarities between the individual instances & how similarities link them together
- Allows the data analyst to see the groupings (the landscape) of data similarity before deciding on the number of clusters to extract
- Dendrogram shows where natural clusters occur
- For hierarchical clustering we need a distance function between clusters
- Linkage function: distance function between clusters. Individual instances are the smallest clusters
- E.g.: linkage function could be Euclidian distance between the closest points of each cluster
- Complete-linkage takes the maximum distance between clusters.
- Single-linkage takes the minimum distance between clusters.
What are the steps to perform k-means clustering?
- Step: Select a center(s) (possibly randomly selected) and find the points closest to the chosen centers.
- Step: Find the actual center of the clusters = the centroid, which does need to be an observation
- Centroid has most probably shifted
- Run several iterations (new cluster center > select points that are closest > select cluster center)
- Normally runs several times, each time with a different random center in the beginning. Results can be analyzed by numeric measures such a cluster’s distortion (sum of the squared differences between instances & centroid).
How do k-means and hierarchical clustering compare in terms of efficiency?
- K-means algorithm is efficient & relatively fast even with multiple runs
- Distance calculated between cluster points and the center
- Hierarchical clustering generally slower than k-means clustering
- Distance calculated between all pairs of clusters on each iteration
How do you determine a good value for k?
- Simply experiment with k-values and examine the results
- Minimum # of k is where the stabilization begins (quality metric = stabile or plateu)
- Value for k can be adapted depending on if clusters are
- too large (too broad) or
- too small (overly specific)
How do we understand the results of clustering?
How can we interpret the results and its implications?
- Specific data mining results
- Hierarchical clustering: Dendrogram
- Object-based clustering: set of cluster centers + data points of each cluster
- Implication of results
- Difficult to understand what, if anything, the clustering reveals in the end
- Not sure how clusters can be exploited for a benefit
- Creativity in the evaluation stage of the data mining process is key
What are Characteristic and Differential descriptions?
- Characteristic description
- It describes what is typical or characteristic of the cluster, ignoring whether other clusters might share some of these characteristics
- Example technique: Lapointe and Legendre’s
- Differential description;
- It describes only what differentiates this cluster from the others, ignoring the characteristics that may be shared by whiskeys within it.
- Example technique: Decision tree
- Characteristic descriptions concentrate on intergroup commonalities, whereas differential descriptions concentrate on intragroup differences. Neither is inherently better —it depends on what you’re using it for.