Clustering Flashcards

1
Q

Clustering Process

A
  1. Initialization
  2. Compute similarity between objects/clusters
  3. Iteratively cluster or assign objects based on similarity
    between objects/clusters
  4. Stop if a stopping condition (threshold) is met or return
    to step 1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What type partitioning Kmeans

A

SimpleK-means: distance-based partitioning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Types of Clustering Methods

A

SimpleK-means: distance-based partitioning
• EM (Expectation Maximization): statistical
modeling
• Farthest-first: use Farthest-first traversal algorithm
(Hochbaum & Shmoys, 1985) with distance-based
partitioning method
• Density-based: density-based
• Cobweb: model-based conceptual clustering
• Hierarchical Clusterer: hierarchical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Taxonomy of Clustering

A

• Distance-versus-density-versus-model based clustering and
stopping criteria
• Distance-based: reduce intra-cluster distance and/or increase
inter-cluster distance
• Density-based: increase density in a cluster
• Model-based: cluster based on a certain mathematic model, e.g.,
a probability model or a neural network
• Partitioning-versus-merging clustering
• Partitioning: divide objects into clusters iteratively
• Merging: merge clusters into larger clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Important Factors Affecting Distancebased

Clustering

A

• Selection of object attributes during pre-processing
• Selection of clustering method
• Selection of similarity measure
• Selection of other method (e.g., # of clusters) and output
parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Similarity and Distance

A

• An object (e.g., a customer) has a list of
variables (e.g., attributes of a customer such as
age, spending, gender etc.)
• To measure similarity between two objects we
measure similarity between these objects’
attribute values based on a distance function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Distance measure

A

– how dissimilar (similar) objects are
• Non-negative
• Distance between the same objects = 0
• Symmetric
• The distance between two objects, A & B, is smaller than the sum
of the distance from A to another object C and the distance from
C to B

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

2 Distance Variables

A

• Numeric variables
Manhattan distance
• Euclidean distance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Manhattan

A

Distance
• For two objects X and Y with n numeric variables,
distance is defined as:
and … are values of variables of object Y
where … are values of variables of object X

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

equation example for manhattan Distance

A

Distance
• E.g., Manhattan distance (Sue, Carl)=
|21-27|+|2300-2600|=306

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Distance Euclidean

A

• For two objects X and Y with n numeric variables, Euclidean
distance is defined as:
d(X,Y ) = h (x1 − y1)2 +(x2 − y2 )2 +…+(xn − yn )2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Distance

• Binary variables

A
Distance
• Binary variables
NAME Married Gender Home Internet
SUE N F Y
CARL N M Y
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Distance

• Nominal/ordinal variables

A
Distance
• Nominal/ordinal variables
NAME Income Internet State
Level Usage Level
SUE Low 10 UT
CARL Low 10 CA
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Variable Transformation

A

We can create dummy variables to dummy
code a categorical variable. (recall dummy
variables in regression models)
• We assign 0/1 based on exact-match criteria.
E.g.,
• Same state = 0, different state = 1
• Same gender = 0, different gender = 1
• Same marital status=0, different status=1
• Same home Internet=0, different home=1
• We can also 􀀁rank􀀂 an attribute. E.g.,
• High income =3, med = 2, low = 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Normalizing variable

A

Min-Max normalization of variable values:
• In the previous example, 􀀁spending􀀂 is dominant
• Set the minimum and maximum distance values for each dimension to be the same
(e.g., 0 - 100)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Z-score Standardization

A

Z-score Standardization

• Standardized value = (original value – mean value)/ mean absolute deviation

17
Q

Clustering Around Centroids

A

• The most popular centroid-based clustering algorithm is
called k-means clustering.
§ “means” are the centroids, represented by the arithmetic
means (averages) of the values along each dimension for the
instances in the cluster.
§ k - the number of clusters that one would like to find in the
data.

18
Q

What is clustering?

A

Clustering is an unsupervised machine learning task that automatically divides the
data into clusters, or groups of similar items. It does this without having been told
how the groups should look ahead of time. As we may not even know what we’re
looking for, clustering is used for knowledge discovery rather than prediction. It
provides an insight into the natural groupings found within data.