Clustering Flashcards

Question 1

Q

Clustering Process

Answer

A

Initialization
Compute similarity between objects/clusters
Iteratively cluster or assign objects based on similarity
between objects/clusters
Stop if a stopping condition (threshold) is met or return
to step 1

Question 2

Q

What type partitioning Kmeans

Answer

A

SimpleK-means: distance-based partitioning

Question 3

Q

Types of Clustering Methods

Answer

A

SimpleK-means: distance-based partitioning
• EM (Expectation Maximization): statistical
modeling
• Farthest-first: use Farthest-first traversal algorithm
(Hochbaum & Shmoys, 1985) with distance-based
partitioning method
• Density-based: density-based
• Cobweb: model-based conceptual clustering
• Hierarchical Clusterer: hierarchical

Question 4

Q

Taxonomy of Clustering

Answer

A

• Distance-versus-density-versus-model based clustering and
stopping criteria
• Distance-based: reduce intra-cluster distance and/or increase
inter-cluster distance
• Density-based: increase density in a cluster
• Model-based: cluster based on a certain mathematic model, e.g.,
a probability model or a neural network
• Partitioning-versus-merging clustering
• Partitioning: divide objects into clusters iteratively
• Merging: merge clusters into larger clusters

Question 5

Q

Important Factors Affecting Distancebased

Clustering

Answer

A

• Selection of object attributes during pre-processing
• Selection of clustering method
• Selection of similarity measure
• Selection of other method (e.g., # of clusters) and output
parameters

Question 6

Q

Similarity and Distance

Answer

A

• An object (e.g., a customer) has a list of
variables (e.g., attributes of a customer such as
age, spending, gender etc.)
• To measure similarity between two objects we
measure similarity between these objects’
attribute values based on a distance function.

Question 7

Q

Distance measure

Answer

A

– how dissimilar (similar) objects are
• Non-negative
• Distance between the same objects = 0
• Symmetric
• The distance between two objects, A & B, is smaller than the sum
of the distance from A to another object C and the distance from
C to B

Question 8

Q

2 Distance Variables

Answer

A

• Numeric variables
Manhattan distance
• Euclidean distance

Question 9

Q

Manhattan

Answer

A

Distance
• For two objects X and Y with n numeric variables,
distance is defined as:
and … are values of variables of object Y
where … are values of variables of object X

Question 10

Q

equation example for manhattan Distance

Answer

A

Distance
• E.g., Manhattan distance (Sue, Carl)=
|21-27|+|2300-2600|=306

Question 11

Q

Distance Euclidean

Answer

A

• For two objects X and Y with n numeric variables, Euclidean
distance is defined as:
d(X,Y ) = h (x1 − y1)2 +(x2 − y2 )2 +…+(xn − yn )2

Question 12

Q

Distance

• Binary variables

Answer

A

Distance
• Binary variables
NAME Married Gender Home Internet
SUE N F Y
CARL N M Y

Question 13

Q

Distance

• Nominal/ordinal variables

Answer

A

Distance
• Nominal/ordinal variables
NAME Income Internet State
Level Usage Level
SUE Low 10 UT
CARL Low 10 CA

Question 14

Q

Variable Transformation

Answer

A

We can create dummy variables to dummy
code a categorical variable. (recall dummy
variables in regression models)
• We assign 0/1 based on exact-match criteria.
E.g.,
• Same state = 0, different state = 1
• Same gender = 0, different gender = 1
• Same marital status=0, different status=1
• Same home Internet=0, different home=1
• We can also 􀀁rank􀀂 an attribute. E.g.,
• High income =3, med = 2, low = 1

Question 15

Q

Normalizing variable

Answer

A

Min-Max normalization of variable values:
• In the previous example, 􀀁spending􀀂 is dominant
• Set the minimum and maximum distance values for each dimension to be the same
(e.g., 0 - 100)

Question 16

Q

Z-score Standardization

Answer

Study These Flashcards

A

Z-score Standardization

• Standardized value = (original value – mean value)/ mean absolute deviation

Question 17

Q

Clustering Around Centroids

Answer

Study These Flashcards

A

• The most popular centroid-based clustering algorithm is
called k-means clustering.
§ “means” are the centroids, represented by the arithmetic
means (averages) of the values along each dimension for the
instances in the cluster.
§ k - the number of clusters that one would like to find in the
data.

Question 18

Q

What is clustering?

Answer

Study These Flashcards

A

Clustering is an unsupervised machine learning task that automatically divides the
data into clusters, or groups of similar items. It does this without having been told
how the groups should look ahead of time. As we may not even know what we’re
looking for, clustering is used for knowledge discovery rather than prediction. It
provides an insight into the natural groupings found within data.

Clustering Flashcards

(18 cards)