Part 2: Clustering Flashcards

Question 1

Q

Clustering (unsupervised)

Answer

A

Similarity
Partitioning (combin. and k-means)
Hierarchical
Model based

Question 2

Q

Application of clustering

Answer

A

Market segmentation: groups of customers based on past purchasing behavior, demographic characteristics or other features.
City-planning: groups of houses according to their house type, value, and geographical location.
Medical: divide lab data in two clusters; healthy and suspicious.

Question 3

Q

Objective of clustering

Answer

A

Divide the objects in the data into natural groups (clusters) such that:
- Objects in the same cluster are similar.
- Objects in different clusters are dissimilar.
Unsupervised: no class and examples determine how the data can be grouped together.

Question 4

Q

Cluster analysis

Answer

A

Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups. The distance in a cluster between data points are minimized and the distance between data points of different clusters are maximized.
- Goal is to group together similar data, what does that mean? 
\+ Similarity measure is often more important that the clustering algorithm used.
\+ Define a distance function like in k-nearest neighbors but: in clustering no class label available: discriminating attributes.

Question 5

Q

Distance

Answer

A

- For numerical values: 
\+ Euclidean distance
\+ Manhattan distance
- For nominal/binary attributes:
\+ Proportion of unequal attributes out of the number of attributes.
W

When you have mixed attributes, you need to use weights, because you don’t want one attribute to dominate everything.

Question 6

Q

Types of clustering methods

Answer

A

Partitioning methods
Hierarchical methods
Model based clustering

Question 7

Q

Partitioning - Combinatorical

Answer

A

Assign cluster to each object (k clusters)
Minimize sum of distances within clusters
Maximize sum of distances between clusters

Question 8

Q

The distances (combinatorical)

Answer

A

W contains all distances between points in the same cluster.
B contains the sum of all distances of points not in the same cluster.
W + B = constant -> when you combine W and B, you have the sum of all distances, counted twice.
The algorithm minimizes W and maximizes B. Minimizing W takes exponential time in the number of data points.

Question 9

Q

K-means clustering

Answer

A

Because minimizing W takes a large amount of time, the k-means clustering is developed. This is a good approximation, not exact, but fast.

Partitional clustering approach.
Each cluster is associated with a centroid (center point)
Each point is assigned to the cluster with the closest centroid.

Steps:

Select K points as the initial centroids.
Repeat
Form K clusters by assigning all points to the closest centroids.
Recomputed the centroid for each cluster until the centroids don’t change.

Question 10

Q

How many clusters are needed/wanted?

Answer

A

Algorithmic approach

- Decide yourself

Question 11

Q

Selecting the right K

Answer

A

Try different k, looking at the change in the average distance to centroids as k increases.
Average falls rapidly until right k, then changes little.

Question 12

Q

Hierarchical clustering

Answer

A

Agglomerative -> bottom-up

- Divisive -> top-down

Question 13

Q

Hierarchical clustering - Bottom-up

Answer

A

Start with lot of clusters, because every data point has its own cluster.
At each level we decide to merge two clusters into one.
From the distance matrix compute the nearest clusters and merge.
Update distance matrix.

Question 14

Q

Hierarchical clustering - cluster distance

Answer

A

Single linkage
Complete linkage
Average linkage

Question 15

Q

Hierarchical clustering - Top-down

Answer

A

Apply 2 means to the root.
Apply k-means to new clusters.
Repeat k-means algorithm.

Question 16

Q

Association/Clustering

Answer

Study These Flashcards

A

Unsupervised: no class label.

Association looks at the different attributes of one object -> which attributes are strongly related?
Clustering looks at different objects with one attribute -> which objects are similar?

Part 2: Clustering Flashcards

(16 cards)