Data Clustering Flashcards
Supervised Learning
discovers patterns in data that relate data attributes with a target (class) attribute. These patterns are then utilized to predict the values of the target attribute in future data instances. This uses a labelled dataset.
Unsupervised Learning
The data has no target attribute. We want to explore the data to find some intrinsic structures or patterns in them that will allow us to cluster the data in related subsets. Dataset here is not labelled.
Clustering
Given a collection of objects (characterized by feature vectors). The clustering algorithm detects the presence of distinct groups (clusters) and assigns class labels to the members of each group. Used to find similarities in groups of data. It groups instances that are similar to each other in clusters. Unsupervised learning task since no class values denoting a grouping of data instances are given.
Clustering Examples
Groups people of similar sizes to make T Shirts size S,M,L. Tailor made for one person is too expensive, one size fits all is not a good fit.
In marketing, segment customers according to interests and similarities.
Given a collection of text documents, organize them according to content similarities.
Aspects of Clustering
Types of Clustering
A distance function
Clustering qualities
The quality of the clustering result depends on the algorithm, distance function, input data and the application.
Types of Clustering
Partitional Clustering, Hierarchical Clustering (agglomerative/divisive)
Clustering qualities
Inter-clustering distance is maximized, Intra-clustering distance is minimized.
Is Clustering Ill-defined
Ambiguity in problems: Unlike supervised learning tasks that have clear objective functions (like minimizing loss), clustering often lacks such explicit goals.
Variability in Shapes and Densities: The correct number of clusters or their shapes can vary significantly depending on the data and the domain specific needs, leading to ambiguity.
Subjectivity: The definition of what constitutes a good cluster can be subjective and application specific.
Feature Space: Choice of feature representation and distance metric can significantly affect the clustering outcome, but there is often no universal rule for making these choices.
K-Means Algorithm
K-means is a partitional clustering algorithm. Let the set of data points/instances D be: {x1,x2…xn} where xi={xi1,xi2…xir) is a vector in a real valued space X in element Rr and r is the number of attributes/dimensions in the data.
The K-Means algorithm partitions the data into K clusters:
Each cluster has a cluster center called centroid
K is specified by the user
K-Means clustering forms the groups in a manner that minimizes the variances between the data points and the cluster’s centroid.
K-Means Clustering
Given k, the k-means algorithm works as follows:
1 Randomly choose k data points (seeds) to be the initial centroids, cluster centers.
2 Assign each data point to the closest centroid (given some distance function f)
3 Re-compute the centroid using the current cluster membership
4 If a convergence criteria is not met, go to step 2 again
Disk Version of K-Means
K-Means can be implemented with the data points on disk:
1 In each iteration it scans the data once
2 The centroids can be computed incrementally
It can be used to cluster large datasets that do not fit in main memory. We need to control the number of iterations (practically, the limit is <50). Not the best method, there are better scale up algorithms like BIRCH.
Strengths of K-Means
Simple, easy to understand. Efficient, time complexity O(tkn) where t=number of iterations, k=number of clusters and n=number of data points.
Most popular clustering algorithm. It terminates at a local optimum if SSE is used. Global optimum is hard to find due to the required computational complexity.
Weaknesses of K-Means
The algorithm is only applicable if the mean is defined. For categorical data, k-mode - the centroid is represented by most frequent values. The user needs to specify k. The algorithm is sensitive to outliers. Outliers are data points that are very far away from other data points. Could be errors in the data collection or some special data points with very different values.
A method to deal with outliers
remove some data points in the clustering process that are much further away from the centroids than other data points. To be safe we may monitor the possible outliers over a few iterations to then decide to remove them.
Another method is to use random sampling. Since in sampling we choose only a small subset of the data points, the chance of selecting an outlier is small. Assign the rest of the data points to the clusters by distance or similarity, or classification.
K-Means Summary
Despite weaknesses, it is the most popular algorithm as it is efficient, simple and ubiquitous. There is no clear evidence that any other clustering algorithm performs better in general, although some may be more suitable for specific types of data or applications. Comparing different clustering algorithms is a difficult task. No one knows what the correct clusters should be.