Clustering Flashcards
Clustering Process
- Initialization
- Compute similarity between objects/clusters
- Iteratively cluster or assign objects based on similarity
between objects/clusters - Stop if a stopping condition (threshold) is met or return
to step 1
What type partitioning Kmeans
SimpleK-means: distance-based partitioning
Types of Clustering Methods
SimpleK-means: distance-based partitioning
• EM (Expectation Maximization): statistical
modeling
• Farthest-first: use Farthest-first traversal algorithm
(Hochbaum & Shmoys, 1985) with distance-based
partitioning method
• Density-based: density-based
• Cobweb: model-based conceptual clustering
• Hierarchical Clusterer: hierarchical
Taxonomy of Clustering
• Distance-versus-density-versus-model based clustering and
stopping criteria
• Distance-based: reduce intra-cluster distance and/or increase
inter-cluster distance
• Density-based: increase density in a cluster
• Model-based: cluster based on a certain mathematic model, e.g.,
a probability model or a neural network
• Partitioning-versus-merging clustering
• Partitioning: divide objects into clusters iteratively
• Merging: merge clusters into larger clusters
Important Factors Affecting Distancebased
Clustering
• Selection of object attributes during pre-processing
• Selection of clustering method
• Selection of similarity measure
• Selection of other method (e.g., # of clusters) and output
parameters
Similarity and Distance
• An object (e.g., a customer) has a list of
variables (e.g., attributes of a customer such as
age, spending, gender etc.)
• To measure similarity between two objects we
measure similarity between these objects’
attribute values based on a distance function.
Distance measure
– how dissimilar (similar) objects are
• Non-negative
• Distance between the same objects = 0
• Symmetric
• The distance between two objects, A & B, is smaller than the sum
of the distance from A to another object C and the distance from
C to B
2 Distance Variables
• Numeric variables
Manhattan distance
• Euclidean distance
Manhattan
Distance
• For two objects X and Y with n numeric variables,
distance is defined as:
and … are values of variables of object Y
where … are values of variables of object X
equation example for manhattan Distance
Distance
• E.g., Manhattan distance (Sue, Carl)=
|21-27|+|2300-2600|=306
Distance Euclidean
• For two objects X and Y with n numeric variables, Euclidean
distance is defined as:
d(X,Y ) = h (x1 − y1)2 +(x2 − y2 )2 +…+(xn − yn )2
Distance
• Binary variables
Distance • Binary variables NAME Married Gender Home Internet SUE N F Y CARL N M Y
Distance
• Nominal/ordinal variables
Distance • Nominal/ordinal variables NAME Income Internet State Level Usage Level SUE Low 10 UT CARL Low 10 CA
Variable Transformation
We can create dummy variables to dummy
code a categorical variable. (recall dummy
variables in regression models)
• We assign 0/1 based on exact-match criteria.
E.g.,
• Same state = 0, different state = 1
• Same gender = 0, different gender = 1
• Same marital status=0, different status=1
• Same home Internet=0, different home=1
• We can also rank an attribute. E.g.,
• High income =3, med = 2, low = 1
Normalizing variable
Min-Max normalization of variable values:
• In the previous example, spending is dominant
• Set the minimum and maximum distance values for each dimension to be the same
(e.g., 0 - 100)