Clustering Flashcards

Question 1

Q

Clustering Critereon

Answer

A

1) Distance
2) conceptual (shared attributes)
3) density

Question 2

Q

Applications of Clustering

Answer

A

Pattern recognition, spatial data analysis, image processing, economic science (market research), WWW

Question 3

Q

Good clustering characteristics (optimized)

Answer

A

High intra class similarity - minimize intra cluster distance

Low inter class similarity - high intercluster distances

Question 4

Q

Major clustering approaches

Answer

A

partitioning algorithms (k-means)
hierarchical methods - cluster tree
density based

Question 5

Q

Partitioning Clustering

Answer

A

Must define the number of clusters you want
Global optimal - exhaustively runs all clusters
Heuristic methods - k-means / k-mediods

Question 6

Q

Hierarchical agglomerative clutering

Answer

A

every point starts as its own cluster. At each consecutive layer they merge until only one big cluster is left at the top. It produces a dendrogram that you can cut at any point. Height of bars indicate how close items are.

agglomerative = bottom up

Question 7

Q

Similarity measures in clustering

Answer

A

Distance based - euclidean, manhattan, minkowski

Correlation distance - the degree two which variables are related

Question 8

Q

Inter cluster similarity

Answer

A

Inter - between clusters.

Min, max, group average, distance between cetnroids, or some other novel measure

Question 9

Q

Hierarchical clustering issues

Answer

A

Distinct cluters are not produced

Methods to cut the dendrogram into clusters exist, but they are somewhat arbitrary

If original data does not have a hierarchical structure, it may be the completely wrong fit

Question 10

Q

K Means clustering algorithm

Answer

A

Step 0: Start with random partition into k clusters (pick k datapoints as starting cluster)

Step 1: Generate a new cluster by assigning each data point to its closest cluster center

Step 2: Compute new centroids

Step 3: repeat steps 1 and 2 until there is no change in membership

Question 11

Q

K means - cluster # optimizaiton

Answer

A

Elbow plot

Reduction in variation

Number of clusters

_________
| /
| /
| /
|/
|_________________________________

Question 12

Q

Properties of K-means

Answer

A

Guaranteed to converge
Guaranteed to achieve local optimal but not necessarily global optimal

Pros
Low Complexity

Cons
Must specify K
Sensitive to noise / outliers
Clusters sensitive to initial random points chosen - can result in different clusters

Question 13

Q

Density based clustering method

Answer

A

Local cluster criterion

Major features:
can handle data of arbitrary shape
handles noisy data well
one scan needed
Need density parameters as termination condition

Question 14

Q

DB Scan Process

Answer

A

Define neighborhood distance N
Defnine minpts (denisty to become a papa cluster)

Categorize points as
Core: has minputs in its N neighborhood
Border: has at least one core point in its neighborhood, but does not meet the minpts criteria)
Noise - all other points

Do a DFS from each core and assign to the same core if it meets the criteria

Question 15

Q

Density Clutering Drawbacks

Answer

A

Very sensitive to user chooses - N and minpts