Chapter 6 Flashcards

Question 1

Q

What are the use case for similarity?

Answer

A

Similarity use cases: classification & regression
Recommendation eCommerce: similarity

Question 2

Q

How can nearest neighbours be used for predictive modelling in the case of classification, probability estimation and regression?

Answer

A

Classification: using the smallest distance of the NN

Probability estimation: using scores

Regression: using the average or the median

Question 3

Q

How many neighbours are needed and how can you solve the issue of points further away having the same influence as those close by?

Answer

A

Nearest neighbor algorithms: k-NN (k = # of neighbors)
- The greater k, the more estimates are smoothed out among the neighbors
- No strict rules for choosing the # of neighbors
- Odd number is useful to yield a clear majority
Taking into account the distance of a neighbor to the target instance
- Weighted voting / Similarity moderated voting: eliminates the risk of even outcomes by using weighted voting or similarity moderated voting, such that each neighbors contribution is scaled by its similarity
- Contribution of a neighbor drops, the further away they are from the instance

Question 4

Q

How does kNN overfitting vary with k?

Answer

A

Boundaries are no lines, no recognizable pattern = random
1-NN: predicts perfectly for all training data and reasonably well for new ones (based on similar observations in the training data)
- k-NN = complexity parameter
  - Low complexity: k = n
    - Predicts the average value in the dataset for each case
    - If k = n : takes into account all instances = no complexity
    - If k = 30: takes into account 30 closest neighbors
  - High complexity: k = 1
    - Complicated boundaries, every training example will have its own boundary

Question 5

Q

What can you use to choose the best value for k?

Answer

A

Choosing perfect level of k with:
- Cross-validation
- Nested Holdout testing on training set

Question 6

Q

What are the issues with nearest neighbour methods?

Answer

A

If model intelligibility & justification are critical, then nearest-neighbor methods should be avoided!
With complex & heterogenous (differing) attributes things become more complex (scale etc.)
Must be taken care, that similarity / distance computation is meaningful for application
Intelligibility of the entire model
- Use of the model depends on the case
Issues with Justification of a specific decision
- explanation why & how recommendation was made can’t always be given.
- If adequate depends on the case at hand (Netflix vs. credit denial)
Dimensionality & domain knowledge
- Curse of dimensionality: Too many and irrelevant attributes (sometimes with vastly different numeric values) contribute to calculating the distance
  - Solved by feature selection: mostly manually by data miner
  - Solved by tuning distance function: assigning different weights to the attributes
Computational efficiency
- Some application require extremely fast predictions
- Nearest Neighbout not ideally suited for those applications

Question 7

Q

Describe Hierarchical Clustering

Answer

A

Focuses on the similarities between the individual instances & how similarities link them together
Allows the data analyst to see the groupings (the landscape) of data similarity before deciding on the number of clusters to extract
Dendrogram shows where natural clusters occur
For hierarchical clustering we need a distance function between clusters
Linkage function: distance function between clusters. Individual instances are the smallest clusters
- E.g.: linkage function could be Euclidian distance between the closest points of each cluster
- Complete-linkage takes the maximum distance between clusters.
- Single-linkage takes the minimum distance between clusters.

Question 8

Q

What are the steps to perform k-means clustering?

Answer

A

Step: Select a center(s) (possibly randomly selected) and find the points closest to the chosen centers.
Step: Find the actual center of the clusters = the centroid, which does need to be an observation
- Centroid has most probably shifted
- Run several iterations (new cluster center > select points that are closest > select cluster center)
- Normally runs several times, each time with a different random center in the beginning. Results can be analyzed by numeric measures such a cluster’s distortion (sum of the squared differences between instances & centroid).

Question 9

Q

How do k-means and hierarchical clustering compare in terms of efficiency?

Answer

A

K-means algorithm is efficient & relatively fast even with multiple runs
- Distance calculated between cluster points and the center
Hierarchical clustering generally slower than k-means clustering
- Distance calculated between all pairs of clusters on each iteration

Question 10

Q

How do you determine a good value for k?

Answer

A

Simply experiment with k-values and examine the results
Minimum # of k is where the stabilization begins (quality metric = stabile or plateu)
Value for k can be adapted depending on if clusters are
- too large (too broad) or
- too small (overly specific)

Question 11

Q

How do we understand the results of clustering?

How can we interpret the results and its implications?

Answer

A

Specific data mining results
- Hierarchical clustering: Dendrogram
- Object-based clustering: set of cluster centers + data points of each cluster
Implication of results
- Difficult to understand what, if anything, the clustering reveals in the end
- Not sure how clusters can be exploited for a benefit
- Creativity in the evaluation stage of the data mining process is key

Question 12

Q

What are Characteristic and Differential descriptions?

Answer

A

Characteristic description
- It describes what is typical or characteristic of the cluster, ignoring whether other clusters might share some of these characteristics
- Example technique: Lapointe and Legendre’s
Differential description;
- It describes only what differentiates this cluster from the others, ignoring the characteristics that may be shared by whiskeys within it.
- Example technique: Decision tree
Characteristic descriptions concentrate on intergroup commonalities, whereas differential descriptions concentrate on intragroup differences. Neither is inherently better —it depends on what you’re using it for.

Chapter 6 Flashcards

(12 cards)