Clustering Flashcards by Enrico Carraro

Learning curve

Shows how accuracy changes with varying sample sizes.
Requires sampleing schedule for creating learning curve.
Small sample size -> bias in the estimate + variance of estimate

How well did you know this?

Not at all

Perfectly

Methods of estimation

Partitioning labeled data in:
- training set for model building
- test set for model evaluation
Partitioning techniques:
-holdout
-crossvaluation

How well did you know this?

Not at all

Perfectly

Holdout

Fixed partitioning: reserve 2/3 for training 1/3 for testing

Appropriate for large datasets: may be repeated several times (repeated holdout)

How well did you know this?

Not at all

Perfectly

Crossvalidation

Crossval:
- parition data into k disjoin subsets
- k-fold: train on k-1 paritions, test on te remaining one, repeat for all folds
- reliable accuracy estimation, not appropriate for very large datasets.
- this is used for estimating accuracy, but at the end the model is retrained on the whole labeled dataset.
Leave-one-out
- cros validation for k=n
- only appropriate for very small datasets

How well did you know this?

Not at all

Perfectly

Limitations of accuracy

Cardinality calss 0: 9900
Cardinality class 1: 100
Model: () -> 0
Accuracy is not appropriate for unabalanced class label distribution and different class relevance

How well did you know this?

Not at all

Perfectly

Recall,Precision

Recall = Number of objects assigned to c / Number of objects belonging to C
Precision = Number of objects assigned correctly to c / Number of objects assigned to c

We need to balance the two:
maximize ( F-measure = 2rp/(r+p) )

Not always the F measure is important, sometimes we want to maximize one of the two, or another combined measure.

How well did you know this?

Not at all

Perfectly

ROC

y: TPR (True positive rate): TP/(TP+FN)
x: FPR (False positive rate) FP/(FP+TN)
TP: true positive, FN: false negative, FP: False positive

How well did you know this?

Not at all

Perfectly

Hierarchical clustering, intro

Produces a set of nested clusters organized as hierarchical tree.
We iterate and cluster nested clusters with unclustered points or other nested clusters. The algorithm ends when there is only one cluster, we can then cut the dendogram at a lower level to get a number > 1 of clusters.
The dendogram shows an the y axis the logical distance between the merged clusters, by means of the chosen metric.
Memory complexity: O(n^2)
Time complexity: O(n^3) some times reduced to O(n^2*log(n))

How well did you know this?

Not at all

Perfectly

Hierarchical clustering, algo

Basic algorithm is straightforward
Compute the proximity matrix
Let each data point be a cluster
Repeat
	Merge the two closest clusters
	Update the proximity matrix
Until only a single cluster remains

How well did you know this?

Not at all

Perfectly

MIN, MAX, Group Average

Proximity measures between clusters: please note that the first step in MIN is the same of MAX, finding the minimum distance between points, we use other proximity measures when we have clusters, not single points.
MIN:
- PROS: can handle non elliptical shapes -> k-means can’t
- cons: sensitive to noise and outliers
MAX: select in the matrix the minimum max distance between points in the clusters
- PROS: less susceptible to noise and outliers
- Cons: tends to break large clusters, and is biased towards globular clusters
Group average: Compromise between MIN and MAX
-pros: less susceptible to noise and outliers
-cons: biased toward globular clusters
Ward’s Method:
- Similarity of two clusters is based on the increase in squared error when two clusters are merged
- Analogue of K-means, can be used to initalize K-means

How well did you know this?

Not at all

Perfectly

DBSCAN

The only parameter is density and minPts: it uses density to define clusters, the number of clusters is decided by the algo.
MinPts is the minimum number of points in the cluster.
Density: number of points within a specified radius (Eps)
- A point is a core point if it has more than a specified number of points (MinPts) within Eps: These are points that are at the interior of a cluster
- A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point
- A noise point is any point that is not a core point or a border point.

Pros:
- Resistant to Noise
- Can handle clusters of different shapes and sizes
Cons:
- Doesn’t recogniezes clusters with non-uniform densities and high dimensional data.
- We can counter that by applying the algortihm two times, one with high density and one with higher density on the noise points of the first algo.

Determining Epsilon: distance of every point to its kth nearest neighbour.

How well did you know this?

Not at all

Perfectly

Measures of cluster

Cluster cohesion: measures how closely related are objects in a cluster (WSS)
Cluster separation: measure how distinct or well-separated a cluster is from other clusters (BSS)
Silhoutte: a succint measure to evaluate how well each object lies within its cluster
- for each object i
- a(i) the average dissimilarity of i with all other data within the samecluster (the smaller the value, the better the assignment)
- b(i) is the lowest average dissimilarity of i to any other cluster, of which i is not a member
- the average s(i) over all that of the data of a cluster measures how tightly grouped all the data in a cluster is
- s(i) = (b(i)-a(i))/max{a(i),b(i)}

The validation of clustering structures is the most difficult and frustrating part of cluster analysis.

How well did you know this?

Not at all

Perfectly

Application of cluster analysis

Understanding: group related documents for browsing, group genes and proteins that have similar functionality, group stock with similar price fluctuations
Summarization: reduce size of data sets

How well did you know this?

Not at all

Perfectly

Types of clustering

Partitioninal: division into non overlapping subsets, such that each data object is in exactly one subset
Hierachical : a set of nested clusters
Exclusive vs non exclusive
Fuzzy: a point belongs to every cluster with some weight
Partial versus complete: cluster only some data
Heterogeneus: widely different clusters

How well did you know this?

Not at all

Perfectly

K-means

Partitional clustering approach
Each cluster is associated with a centroid (center point)
Each point is assigned to the cluster with the closest centroid
Number of clusters, K, must be specified
The basic algorithm is very simple:
1.Select k points as initial records
2.repeat
3. form k clusters by assigning all points to the closest centroid
4. recompute the centroid of each cluster
5.until the cetroids don’t change

Choosing the inital centroids is very important, with k, the number of centroids.

How well did you know this?

Not at all

Perfectly

Evaluating K-means clusters

Study These Flashcards

Most common way: Sum of squared error (SSE) -> for each point the error is the distance to the nearest cluster -> SSE = SUM(i = 1, K, SUM(x, Ci, dist^2(mi, x)))
- x is a datapoint in cluster Ci and mi is the representative point for Ci.
- Given two clusters, we can choose the one with the smallest error
One easy way to reduce SSE is to increase K, the number of clusters.
- A good clustering with smaller K can have a lower SSE than a poor clustering with higher K

K-means parameter setting

Study These Flashcards

We can plot the quality measure (usally SSE) agains k

- We choose k when the gain of adding a centroid is negligible, and the reduction of quality is not interesting

K-means solutions to inital centroids problem

Study These Flashcards

Multiple runs
sample and use hierarchila clustering
select more than k inital centroid and then select among these initial centroid, selecting the most widely separated
Bisecting K-means: not as susceptible to initialization issues
Postprocessing: eliminate small clusters that may represent outliers, split loose clusters, merge clusters that are close, can also use this steps during the clustering process.

Bisecting K-means

Study These Flashcards

At each iterations generates only 2 clusters, with random positioned centroids, then select the cluster with worst SSE, divide it in 2 and repeat.
I stop when the total SSE is good enough.

Limitations of k-means

Study These Flashcards

K-means has problems when clusters are of differing

Sizes
Densities
Non-globular shapes

Since every datapoint has to be part of a cluster, outliers are not handled well.

Bayesian classification, theorem

Study These Flashcards

Bayesian classification, construction

Study These Flashcards

C = any class label
X =  record to be classified

What to do:

compute P(C|X) for all classes
- probability that record X belongs to C
- assign X to the class with maximal P(C|X)

Applying Bayes theorem: P(C|X) = P(X|C)·P(C) / P(X)
- P(X) constant for all C, disregarded for maximum computation
- P(C) a priori probability of C
P(C) = Nc/N

Bayesian classification, calculating P(xk|C)

Study These Flashcards

For discrete attributes P(xk|C) = |xkC|/ Nc
where |xkC| is number of instances having value xk for attribute k and belonging to class C

for continuous attributes, use probability distribution

Support Vector Machines

Study These Flashcards

Define a hyperplane that maximizes margin between two sets.
There are also Nonlinear support vector machines, use kernels to map sets into a space where the problem becomes linear.

There are lots of paremeters to tinker with.

K-nearest neighbors

A use the K nearest neighbors to the data point I have to classify, and assigns it to the voted/best class for a given measure. The parameter K is not easy to select. K too small ->sensitive to noise K too big -> neighbourhood may include points from other classes. Pro: - incremental (I can add training data without recalculating model) - model does not exist (no training time) Cons: - Model is not interpretable - Classification is not fast

Accuracy

(TP+TN)/(TP+TN+FP+FN)

Clustering Flashcards

(26 cards)