Lecture 8 - Unsupervised Learning I Flashcards

1
Q

What do patterns tell?

A

Patterns describe or summarize the data set or parts of it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is cluster analysis?

A

Identifying groups of “similar” data objects?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are association rules?

A

Finding associations between attributes or typical combinations, like demand = high, supply = low then price = high

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is deviation analysis?

A

Finding groups that deviate from the rest of the data, like men under 30 differ from the whole dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is hierarchical clustering?

A

Hierarchical clustering builds clusters step by step

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is agglomerative hierarchical clustering?

A

A bottom up strategy by first considering each data object as a separate cluster and then step by step joining clusters together

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is divisive hierarchical clustering?

A

Starting with the whole data set as a one cluster and then dividing it to smaller ones. Seldom used, because first step has 2^(n-1) steps

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is isotrophic distance?

A

Distance grows equally fast in all directions (like euclidean)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is nonisotrophic distance?

A

Distances have different weightings for different directions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the b, n and x in the dissimilarity measures?

A

b = hold in both records, n = do not hold in both records, j = hold in only one of both records

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is single linkage?

A

Dissimilarity between the two most similar data objects (so two closest ones connected)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a complete linkage?

A

Dissimilarity between two most dissimilar data objects (so two that are furthest away connect)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is average linkage?

A

Average dissimilarity between two points of two clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is centroid linkage?

A

Distance between two centroids (mean value vectors)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are dendrograms?

A

The cluster merging process arranges data points in a binary tree, when drawing the data tuples at the bottom, draw a connection between clusters that are merged with the distance to the data points, then cut the dendogram at specific point to get the clusters for that distance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are some approaches to choose the clusters?

A

Simplest approach: Specify a minimum desired distance between clusters, stop merging when farther apart

Visual approach: Merge clusters until all data points are combined into one cluster, draw the dendrofram and find a good cut level (doesn’t have to be horizontal)

More sophisticated approaches: analyze the sequences, find a step where the step is a lot larger than previous step

17
Q

What is k-Means clustering?

A

k-means algorithm partitions data points into exactly k clusters, the k must be chosen in advance

The objective is to minimize the total intra-cluster variance

18
Q

How does k-means work?

A

Initialize the cluster centers randomly by selecting k data points, assign each data point to the cluster closest to it, update the centre, repeat until converges

19
Q

What’s the problem with k-means?

A

The results is fairly sensitive to the initial positions of cluster centers, so bad initialisation may lead to failing

20
Q

What is silhouette coefficient?

A

Silhouette value is a measure of how similar an object is to its own cluster compared to other clusters, ranges from -1 to +1, high value indicates that the value is well matched to its own cluster.

21
Q

What is density-based clustering? DBSCan?

A

Using numerical data it is possible to use density-based clustering like DBScan

  1. Find a data point where the density is high, aka in distance x there are at least y other points
  2. All the points in the x distance from the neighbourhood are considered to belong to one cluster
  3. Expand the cluster until there is not at least y points in the distance x