Part 2: Clustering Flashcards

1
Q

Clustering (unsupervised)

A
  • Similarity
  • Partitioning (combin. and k-means)
  • Hierarchical
  • Model based
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Application of clustering

A
  • Market segmentation: groups of customers based on past purchasing behavior, demographic characteristics or other features.
  • City-planning: groups of houses according to their house type, value, and geographical location.
  • Medical: divide lab data in two clusters; healthy and suspicious.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Objective of clustering

A

Divide the objects in the data into natural groups (clusters) such that:
- Objects in the same cluster are similar.
- Objects in different clusters are dissimilar.
Unsupervised: no class and examples determine how the data can be grouped together.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Cluster analysis

A
Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups. The distance in a cluster between data points are minimized and the distance between data points of different clusters are maximized.
- Goal is to group together similar data, what does that mean? 
\+ Similarity measure is often more important that the clustering algorithm used.
\+ Define a distance function like in k-nearest neighbors but: in clustering no class label available: discriminating attributes.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Distance

A
- For numerical values: 
\+ Euclidean distance
\+ Manhattan distance
- For nominal/binary attributes:
\+ Proportion of unequal attributes out of the number of attributes.
W

When you have mixed attributes, you need to use weights, because you don’t want one attribute to dominate everything.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Types of clustering methods

A
  • Partitioning methods
  • Hierarchical methods
  • Model based clustering
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Partitioning - Combinatorical

A
  • Assign cluster to each object (k clusters)
  • Minimize sum of distances within clusters
  • Maximize sum of distances between clusters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

The distances (combinatorical)

A
  • W contains all distances between points in the same cluster.
  • B contains the sum of all distances of points not in the same cluster.
  • W + B = constant -> when you combine W and B, you have the sum of all distances, counted twice.
    The algorithm minimizes W and maximizes B. Minimizing W takes exponential time in the number of data points.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

K-means clustering

A

Because minimizing W takes a large amount of time, the k-means clustering is developed. This is a good approximation, not exact, but fast.

  • Partitional clustering approach.
  • Each cluster is associated with a centroid (center point)
  • Each point is assigned to the cluster with the closest centroid.

Steps:

  • Select K points as the initial centroids.
  • Repeat
  • Form K clusters by assigning all points to the closest centroids.
  • Recomputed the centroid for each cluster until the centroids don’t change.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How many clusters are needed/wanted?

A
  • Algorithmic approach

- Decide yourself

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Selecting the right K

A
  • Try different k, looking at the change in the average distance to centroids as k increases.
  • Average falls rapidly until right k, then changes little.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Hierarchical clustering

A
  • Agglomerative -> bottom-up

- Divisive -> top-down

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Hierarchical clustering - Bottom-up

A
  • Start with lot of clusters, because every data point has its own cluster.
  • At each level we decide to merge two clusters into one.
  • From the distance matrix compute the nearest clusters and merge.
  • Update distance matrix.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Hierarchical clustering - cluster distance

A
  • Single linkage
  • Complete linkage
  • Average linkage
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Hierarchical clustering - Top-down

A
  • Apply 2 means to the root.
  • Apply k-means to new clusters.
  • Repeat k-means algorithm.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Association/Clustering

A

Unsupervised: no class label.

  • Association looks at the different attributes of one object -> which attributes are strongly related?
  • Clustering looks at different objects with one attribute -> which objects are similar?