8: Clustering, association and sequence rules Flashcards
Question 1
Level: easy
Which of the following statements is FALSE? Give answer D if all statements are true.
a) A scree plot shows the intra-cluster distance on the Y-axis versus the number of clusters on the X-axis, and can be used to choose the number of clusters.
b) K-Means clustering method has the disadvantage that the initialization may be important, requiring to reinitialize and re-run the method to test the robustness of the result. Moreover, K-Means is not suitable for categorical data.
c) K-Means clustering method has the advantage that it is efficient in processing large datasets since it can be parallelized, but it is not robust with respect to outliers.
d) All of the above statements are true.
d) All of the above statements are true.
Question 1
What is the downward closure property?
Downward-closure property of support; guarantees that for a frequent itemset, all its subsets must also be frequent
Question 2
What is the difference between association and sequence rules, and how the apriori association rule mining method be extended to retrieve sequence rules?
Association rules and sequence rules are both techniques used in data mining to discover interesting patterns or relationships within datasets.
*Association Rules:
= Identify relationships between different items within the same transaction. eg. “If a customer buys bread and milk, they are likely to also buy eggs.”
*Sequence Rules:
= Reveal patterns in the order of occurrences. eg. If a customer views a product page, adds the product to the cart, and then makes a purchase, it forms a sequence of events
apriori method; looks at support and confidence
Question 3
Explain the k-means clustering method.
K-means clustering is a popular unsupervised machine learning algorithm
used for partitioning a dataset into K distinct, non-overlapping subsets (clusters). The goal of K-means is to
group similar data points together and discover underlying patterns or structures within the data. The
algorithm operates iteratively and is based on
minimizing the sum of squared distances between data points and their assigned cluster’s centroid.
Question 4
What is the difference between agglomerative and divisive clustering methods?
(1)Hierarchical clustering
a. agglomerative -; starting w single element and aggregating them into clusters
b. divisive -; start w complete dataset and dividing it into parts
(2) Partitional (non-hierarchical) clustering; various partitions and some criteria to evaluate ~objective function based
K-means (based on centroids= mean);+ good for large dataset, simple, parallelized
- initialization may be important, not suitable for categorical data, not robust w respect to outliers
PAM (Partition Around Medoids)(median); less sensitive to outliers
Question 5
What is a linkage measure and why is such a measure needed?
Linkage measure is a metric used to quantify the similarity or dissimilarity between clusters when merging or forming hierarchical clusters.
Single Linkage: the minimum distance between any two points, one from each cluster.
Complete Linkage: maximum distance between any two points, one from each cluster.
Average Linkage:average distance between all pairs of points, one from each cluster.
Centroid Linkage: distance between their centroids (the mean of the data points in each cluster).
Ward’s Linkage: minimizes the increase in variance when merging clusters. It aims to minimize the overall within-cluster variance.