Clustering Flashcards
Clustering Process
- Initialization
- Compute similarity between objects/clusters
- Iteratively cluster or assign objects based on similarity
between objects/clusters - Stop if a stopping condition (threshold) is met or return
to step 1
What type partitioning Kmeans
SimpleK-means: distance-based partitioning
Types of Clustering Methods
SimpleK-means: distance-based partitioning
• EM (Expectation Maximization): statistical
modeling
• Farthest-first: use Farthest-first traversal algorithm
(Hochbaum & Shmoys, 1985) with distance-based
partitioning method
• Density-based: density-based
• Cobweb: model-based conceptual clustering
• Hierarchical Clusterer: hierarchical
Taxonomy of Clustering
• Distance-versus-density-versus-model based clustering and
stopping criteria
• Distance-based: reduce intra-cluster distance and/or increase
inter-cluster distance
• Density-based: increase density in a cluster
• Model-based: cluster based on a certain mathematic model, e.g.,
a probability model or a neural network
• Partitioning-versus-merging clustering
• Partitioning: divide objects into clusters iteratively
• Merging: merge clusters into larger clusters
Important Factors Affecting Distancebased
Clustering
• Selection of object attributes during pre-processing
• Selection of clustering method
• Selection of similarity measure
• Selection of other method (e.g., # of clusters) and output
parameters
Similarity and Distance
• An object (e.g., a customer) has a list of
variables (e.g., attributes of a customer such as
age, spending, gender etc.)
• To measure similarity between two objects we
measure similarity between these objects’
attribute values based on a distance function.
Distance measure
– how dissimilar (similar) objects are
• Non-negative
• Distance between the same objects = 0
• Symmetric
• The distance between two objects, A & B, is smaller than the sum
of the distance from A to another object C and the distance from
C to B
2 Distance Variables
• Numeric variables
Manhattan distance
• Euclidean distance
Manhattan
Distance
• For two objects X and Y with n numeric variables,
distance is defined as:
and … are values of variables of object Y
where … are values of variables of object X
equation example for manhattan Distance
Distance
• E.g., Manhattan distance (Sue, Carl)=
|21-27|+|2300-2600|=306
Distance Euclidean
• For two objects X and Y with n numeric variables, Euclidean
distance is defined as:
d(X,Y ) = h (x1 − y1)2 +(x2 − y2 )2 +…+(xn − yn )2
Distance
• Binary variables
Distance • Binary variables NAME Married Gender Home Internet SUE N F Y CARL N M Y
Distance
• Nominal/ordinal variables
Distance • Nominal/ordinal variables NAME Income Internet State Level Usage Level SUE Low 10 UT CARL Low 10 CA
Variable Transformation
We can create dummy variables to dummy
code a categorical variable. (recall dummy
variables in regression models)
• We assign 0/1 based on exact-match criteria.
E.g.,
• Same state = 0, different state = 1
• Same gender = 0, different gender = 1
• Same marital status=0, different status=1
• Same home Internet=0, different home=1
• We can also rank an attribute. E.g.,
• High income =3, med = 2, low = 1
Normalizing variable
Min-Max normalization of variable values:
• In the previous example, spending is dominant
• Set the minimum and maximum distance values for each dimension to be the same
(e.g., 0 - 100)
Z-score Standardization
Z-score Standardization
• Standardized value = (original value – mean value)/ mean absolute deviation
Clustering Around Centroids
• The most popular centroid-based clustering algorithm is
called k-means clustering.
§ “means” are the centroids, represented by the arithmetic
means (averages) of the values along each dimension for the
instances in the cluster.
§ k - the number of clusters that one would like to find in the
data.
What is clustering?
Clustering is an unsupervised machine learning task that automatically divides the
data into clusters, or groups of similar items. It does this without having been told
how the groups should look ahead of time. As we may not even know what we’re
looking for, clustering is used for knowledge discovery rather than prediction. It
provides an insight into the natural groupings found within data.