Clustering Flashcards by m m

K-means clustering algorithm?

-Randomly assign cluster number to each observation

-Until clusters stop changing:
-for each cluster, compute centroid
-assign each observation to cluster with center closest to observation (using Euclidean distance)

How well did you know this?

Not at all

Perfectly

What kind of distance does KMeans Clustering require and why?

-euclidean distance
- minimizes euclidean error
-without euclidean space, the centroid is meaningless

How well did you know this?

Not at all

Perfectly

How to use kmeans with Manhattan distance?

Use PAM (partition around medoids)
- medoids - median data point

How well did you know this?

Not at all

Perfectly

How to use kmeans with Cosine distance?

First normalize every data row, then apply kmeans (data points end up on unit circle)

How well did you know this?

Not at all

Perfectly

What are the different specifications for linkage in hierarchical clustering?

Single linkage - dist between closest points
Complete linkage - dist between farthest points
Avg linkage - mean dist between all points in the 2 clusters
Ward distance - diff between total within cluster sum of squares for the two clusters separately and the within cluster sum of squared resulting from merging the two clusters together

How well did you know this?

Not at all

Perfectly

Which linking method needs the distance to be metric?

ward, centroid based need euclidean distance (they min squared error)
single, complete, avg need symmetry and non-negative

How well did you know this?

Not at all

Perfectly

DBScan core?

When nb of points in circle >= t

How well did you know this?

Not at all

Perfectly

DBScan border?

When nb of points in circle < t, but has core point in circle

How well did you know this?

Not at all

Perfectly

DBScan noise?

When nb of points < t, no core points

How well did you know this?

Not at all

Perfectly

DBScan algortihm?

Determine core, border, noise points
Connect core points within distance e
Find all connected components
Assign border points to closest connected component
Return all components as clusters

How well did you know this?

Not at all

Perfectly

Does DBScan need Euclidean?

How well did you know this?

Not at all

Perfectly

Should we normalize observations/ features first?

No! Influences the distances.

How well did you know this?

Not at all

Perfectly

Centroid-based clustering issues?

-stuck in local minima
-initialization might be problematic
-how to handle streaming data?
-cannot handle non-spherical clusters?

How well did you know this?

Not at all

Perfectly

Hierarchical, density, graph based clustering issues?

-Requires computing distance matrix -> expensive
-streaming data?

How well did you know this?

Not at all

Perfectly

Describe kmeans++

centroid c assigned randomly
for each datapoint x calculate min dist to centroid
assign next c to random point using proportional probability

How well did you know this?

Not at all

Perfectly

Describe CURE

Study These Flashcards

select few datapoints to represent cluster
run expensive (DBScan) on it to find initialization
select representatives for each cluster, make sure they show behaviour of cluster
maintain mini clusters that later are merged into larger ones

Potential problems of CURE?

Study These Flashcards

representatives may be too far to merge and represent outliers, move representatives towards the centroids

Does CURE need euclidean distance?

Study These Flashcards

Yes, because of modification

What are the three sets used by BFR?

Study These Flashcards

Discard set - adds to preexisting cluster
Compressed set - mini clusters (remaining points)
Retained set - outliers
only retained set is stored in memory

What sufficient statistics do we need for a cluster?

Study These Flashcards

n number of datapoints
SUM -vec with sums of all feature values
SUMQ -vec with sums of squared feature values

BFR clustering alg

Study These Flashcards

For each batch
-if point in discard, update stat, discard
- else
-use standard clustering to get minis
-compute stats for minis, discard points
-keep unclustered in retained
- merge minis and retained with older minis, update stats

Does BFR need Euclidean?

Study These Flashcards

Theoretically yes (uses centroids), but has been done without, if careful

What does Silhouette coefficient tell us?

Study These Flashcards

between -1 and 1
large positive values good clustering
negative value somewhat mixed

How to evaluate kmeans clusters?

Study These Flashcards

Using WCV

Can BFR represent non spherical clusters?

can CURE represent non spherical clusters?

Yes

Does BFR discard outliers (points that do not belong to pre-existing clusters?

t-SNE advantages?

-can capture nonlinear relationships -preserves local/ global structures well for visualization -effective in practice

t-SNE disadvantages?

-only used for visualization -parameters need to be set by hand (sensitive to initial conditions) -slow for large datasets

what does the k in UMAP do?

large k- preserves global structure small k- preserves local structure k-nb of nearest neighbours

UMAP advantages?

-control over local vs global preservation -fast -well separated clusters

UMAP disadvantages?

-k should be optimized -distance info still lost (only visualization) -sensitive to optim. parameters and initial conditions

Clustering Flashcards

(32 cards)