Data Clustering Flashcards by Greg Pavia

Supervised Learning

discovers patterns in data that relate data attributes with a target (class) attribute. These patterns are then utilized to predict the values of the target attribute in future data instances. This uses a labelled dataset.

How well did you know this?

Not at all

Perfectly

Unsupervised Learning

The data has no target attribute. We want to explore the data to find some intrinsic structures or patterns in them that will allow us to cluster the data in related subsets. Dataset here is not labelled.

How well did you know this?

Not at all

Perfectly

Clustering

Given a collection of objects (characterized by feature vectors). The clustering algorithm detects the presence of distinct groups (clusters) and assigns class labels to the members of each group. Used to find similarities in groups of data. It groups instances that are similar to each other in clusters. Unsupervised learning task since no class values denoting a grouping of data instances are given.

How well did you know this?

Not at all

Perfectly

Clustering Examples

Groups people of similar sizes to make T Shirts size S,M,L. Tailor made for one person is too expensive, one size fits all is not a good fit.
In marketing, segment customers according to interests and similarities.
Given a collection of text documents, organize them according to content similarities.

How well did you know this?

Not at all

Perfectly

Aspects of Clustering

Types of Clustering
A distance function
Clustering qualities
The quality of the clustering result depends on the algorithm, distance function, input data and the application.

How well did you know this?

Not at all

Perfectly

Types of Clustering

Partitional Clustering, Hierarchical Clustering (agglomerative/divisive)

How well did you know this?

Not at all

Perfectly

Clustering qualities

Inter-clustering distance is maximized, Intra-clustering distance is minimized.

How well did you know this?

Not at all

Perfectly

Is Clustering Ill-defined

Ambiguity in problems: Unlike supervised learning tasks that have clear objective functions (like minimizing loss), clustering often lacks such explicit goals.

Variability in Shapes and Densities: The correct number of clusters or their shapes can vary significantly depending on the data and the domain specific needs, leading to ambiguity.

Subjectivity: The definition of what constitutes a good cluster can be subjective and application specific.

Feature Space: Choice of feature representation and distance metric can significantly affect the clustering outcome, but there is often no universal rule for making these choices.

How well did you know this?

Not at all

Perfectly

K-Means Algorithm

K-means is a partitional clustering algorithm. Let the set of data points/instances D be: {x1,x2…xn} where xi={xi1,xi2…xir) is a vector in a real valued space X in element Rr and r is the number of attributes/dimensions in the data.

The K-Means algorithm partitions the data into K clusters:
Each cluster has a cluster center called centroid
K is specified by the user
K-Means clustering forms the groups in a manner that minimizes the variances between the data points and the cluster’s centroid.

How well did you know this?

Not at all

Perfectly

K-Means Clustering

Given k, the k-means algorithm works as follows:
1 Randomly choose k data points (seeds) to be the initial centroids, cluster centers.
2 Assign each data point to the closest centroid (given some distance function f)
3 Re-compute the centroid using the current cluster membership
4 If a convergence criteria is not met, go to step 2 again

How well did you know this?

Not at all

Perfectly

Disk Version of K-Means

K-Means can be implemented with the data points on disk:
1 In each iteration it scans the data once
2 The centroids can be computed incrementally
It can be used to cluster large datasets that do not fit in main memory. We need to control the number of iterations (practically, the limit is <50). Not the best method, there are better scale up algorithms like BIRCH.

How well did you know this?

Not at all

Perfectly

Strengths of K-Means

Simple, easy to understand. Efficient, time complexity O(tkn) where t=number of iterations, k=number of clusters and n=number of data points.
Most popular clustering algorithm. It terminates at a local optimum if SSE is used. Global optimum is hard to find due to the required computational complexity.

How well did you know this?

Not at all

Perfectly

Weaknesses of K-Means

The algorithm is only applicable if the mean is defined. For categorical data, k-mode - the centroid is represented by most frequent values. The user needs to specify k. The algorithm is sensitive to outliers. Outliers are data points that are very far away from other data points. Could be errors in the data collection or some special data points with very different values.

How well did you know this?

Not at all

Perfectly

A method to deal with outliers

remove some data points in the clustering process that are much further away from the centroids than other data points. To be safe we may monitor the possible outliers over a few iterations to then decide to remove them.

Another method is to use random sampling. Since in sampling we choose only a small subset of the data points, the chance of selecting an outlier is small. Assign the rest of the data points to the clusters by distance or similarity, or classification.

How well did you know this?

Not at all

Perfectly

K-Means Summary

Despite weaknesses, it is the most popular algorithm as it is efficient, simple and ubiquitous. There is no clear evidence that any other clustering algorithm performs better in general, although some may be more suitable for specific types of data or applications. Comparing different clustering algorithms is a difficult task. No one knows what the correct clusters should be.

How well did you know this?

Not at all

Perfectly

Representation of Clusters

Study These Flashcards

Use the centroid of each cluster to represent the cluster.
- Compute the radius and;
- Compute the standard deviation of the cluster to determine its spread in each dimension
- The centroid representation alone works well if the clusters are of hyper-spherical shape and Euclidean distance is used.
- If clusters are elongated, or are of other shapes, centroids may not work well.

Hierarchical Clustering

Study These Flashcards

Produces a nested sequence of clusters, a tree, also called a Dendogram.

Types of Hierarchical Clustering

Study These Flashcards

Agglomerative (bottom up) clustering:
- Builds the dendogram (tree) from the bottom level
- Merges the most similar (or nearest) pair of clusters
- Stops when all the data points are merged into a single cluster (root cluster)

Divisive (top down) clustering:
- Starts with all data points in one cluster, the root:
- Splits the root into a set of child clusters, each child cluster is recursively divided further
- Stops when only singleton clusters of individual data points remain i.e each cluster with only a single point

Agglomerative Clustering

Study These Flashcards

More popular than divisive methods:
- At the start, each data point forms a cluster (also called a node)
- Merge nodes/clusters that have the least distance (the closest)
- Go on merging until eventually all nodes belong to one cluster

Measuring the distance between two clusters: Many ways to do this, results in different variations of the algorithm (Single Link, Complete Link…)

Single Linkage Method

Study These Flashcards

Aka. Minimum or Nearest Neighbour method.
The distance between two clusters is the distance between two closest data points in the two clusters, one data point from each. Can find arbitrarily shaped clusters but may cause the undesirable chain effect by noisy points

Complete Linkage Method

Study These Flashcards

Aka. Maximum or Farthest Neighbour method.
The distance between two clusters is the distance of two furthest data points in the two clusters. It is sensitive to outliers because they are far away.

Average Linkage Method

Study These Flashcards

A compromise between:
- The sensitivity of complete-link clustering to outlier and;
- The tendency of single link clustering to form long chains that do not correspond to the intuitive notion of clusters as compact, spherical objects.
- In this method, the distance between two clusters is the average distance of all pairwise distances between the data points in two clusters.

The Centroid Method

Study These Flashcards

The distance between the two centroids is taken as the distance between the two clusters.

Time Complexity

Study These Flashcards

All the algorithms are at least O(n^2), where n is the number of data points.
Single link can be done in O(n^2)
Complete and Average link can be done in O(n^2log2n)
Due to the complexity it isn’t feasible for large datasets, we use sampling or scale up methods such as BIRCH.

Distance Functions

The selection of the distance function is of key importance in clustering. Clustering is directly based on the distance of data items to each other. Similarity or Dissimilarity are commonly used terms. There are numerous distance functions (metrics) for: - Different types of data: numeric and nominal/categorical data - Different specific applications

Most commonly used distance functions are:

Euclidean distance and Manhattan distance We denote distance with dist(xi,xj) where both x are data points (vectors). They are special cases of Minkowski Distance. Formula on notes.

Distance function for Text Documents

A text document consists of a sequence of sentences and each sentence consists of a sequence of words. In document clustering, a document is usually considered a bag of words, sequence and position are usually ignored. A document is represented with a vector just like a normal data point. It is common to use similarity to compare two documents rather than distance. The most commonly used similarity function is the cosine similarity.

Data Standardization

In the Euclidean space, standardization of attributes may be required so that all attributes can have equal impact on the computation of distances. Standardize attributes to force the attributes to have a common value range. Sometimes called normalization or scaling.

Which Clustering Algorithm to Use

Every algorithm has limitations and works well with certain data distributions. It is very hard to know what distribution the application data follow. The data may not follow any ideal structure fully as required by the algorithms. One needs to decide how to standardize the data, to choose a suitable distance function and to select other parameter values. Due to these complexities, the common practice is to run several algorithms using different distance functions and parameter settings, and then carefully analyze and compare results. The interpretation of results must be based on insight into the meaning of the original data together with knowledge of the algorithm used. Clustering is highly application dependent and to a certain extent subjective.

Data Clustering Flashcards

(29 cards)