Clustering Flashcards

Question 1

Q

Clustering

Answer

A

set of techniques for finding subgroups, or clusters, in a data set. A good clustering is one when the observations within a group are similar but between groups are very different

Question 2

Q

high intra-class similarity:

Answer

A

cohesive within clusters

Question 3

Q

low inter-class similarity

Answer

A

distinctive between clusters

Question 4

Q

Question: What factors determine the quality of a clustering method?

Answer

A

Answer:
The similarity measure used by the method
Its implementation
Its ability to discover some or all of the hidden patterns

Question 5

Q

Centroid based Clustering

Answer

A

representing clusters through its center is called the centroid

Question 6

Q

K-Means

Answer

A

Centroid based algorithm. “means” are the centroids, represented by the arithmetic means (averages) of the values along each dimension for the instances in the cluster

Question 7

Q

what is Clusters’ Distortion

Answer

A

The sum of squared differences between each data point and its corresponding centroid. The clustering with the lowest distortion value can be deemed the best clustering

Question 8

Q

K Means weakness

Answer

A

-Sensitive to noisy data and outliers
– Weak in clustering non-convex shapes

Question 9

Q

K Means strength

Answer

A

– Fast and efficient
– Often terminates at a local optimal solution
– Applicable to continuous n-dimensional space

Question 10

Q

How is K-means observation clustered?

Answer

A

One way of achieving this is to minimize the sum of all the pair-wise squared Euclidean distances between the observations in each cluster.

Question 11

Q

general equation for Euclidean distance in n dimensions is as below:

Answer

A

d(A,B)= Sqrt of (d1,A-d1,B)^2+so on n times

Question 12

Q

what is Dendogram?

Answer

A

Hierarchical Clustering has an added advantage that it produces a tree based representation of the observations

Question 13

Q

Hierarchical Clustering

Answer

A

a tree based representation of the observations, called a Dendogram. doesn’t require to choose number of clusters.

Question 14

Q

Hierarchical Clustering Weakness

Answer

A

Not so scalable;
Distance matrix can be huge to calculate
– Cannot undo what was done previously

Question 15

Q

Hierarchical Clustering Strength

Answer

A

– Availability of dendrogram
– Smaller clusters may be generated.

Question 16

Q

4 Linkage methods

Answer

Study These Flashcards

A

There are four options:
– Complete Linkage
– Single Linkage
– Average Linkage –
- Centriod Linkage

Question 17

Q

How do we define dissimilarity?

Answer

Study These Flashcards

A

Question 18

Q

Even sized clusters are ?

Answer

Study These Flashcards

A

Complete and Average Linkage

Question 19

Q

Dissimilarity meaures

Answer

Study These Flashcards

A

-Euclidean distance
-Correlation based distance

Clustering Flashcards

(19 cards)