Clustering Flashcards

1
Q

Clustering

A

set of techniques for finding subgroups, or clusters, in a data set. A good clustering is one when the observations within a group are similar but between groups are very different

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

high intra-class similarity:

A

cohesive within clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

low inter-class similarity

A

distinctive between clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Question: What factors determine the quality of a clustering method?

A

Answer:
The similarity measure used by the method
Its implementation
Its ability to discover some or all of the hidden patterns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Centroid based Clustering

A

representing clusters through its center is called the centroid

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

K-Means

A

Centroid based algorithm. “means” are the centroids, represented by the arithmetic means (averages) of the values along each dimension for the instances in the cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what is Clusters’ Distortion

A

The sum of squared differences between each data point and its corresponding centroid. The clustering with the lowest distortion value can be deemed the best clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

K Means weakness

A

-Sensitive to noisy data and outliers
– Weak in clustering non-convex shapes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

K Means strength

A

– Fast and efficient
– Often terminates at a local optimal solution
– Applicable to continuous n-dimensional space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How is K-means observation clustered?

A

One way of achieving this is to minimize the sum of all the pair-wise squared Euclidean distances between the observations in each cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

general equation for Euclidean distance in n dimensions is as below:

A

d(A,B)= Sqrt of (d1,A-d1,B)^2+so on n times

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is Dendogram?

A

Hierarchical Clustering has an added advantage that it produces a tree based representation of the observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Hierarchical Clustering

A

a tree based representation of the observations, called a Dendogram. doesn’t require to choose number of clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Hierarchical Clustering Weakness

A
  • Not so scalable;
  • Distance matrix can be huge to calculate
    – Cannot undo what was done previously
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Hierarchical Clustering Strength

A

– Availability of dendrogram
– Smaller clusters may be generated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

4 Linkage methods

A

There are four options:
– Complete Linkage
– Single Linkage
– Average Linkage –
- Centriod Linkage

17
Q

How do we define dissimilarity?

A
18
Q

Even sized clusters are ?

A

Complete and Average Linkage

19
Q

Dissimilarity meaures

A

-Euclidean distance
-Correlation based distance