Clustering Flashcards
Learning curve
Shows how accuracy changes with varying sample sizes.
Requires sampleing schedule for creating learning curve.
Small sample size -> bias in the estimate + variance of estimate
Methods of estimation
Partitioning labeled data in: - training set for model building - test set for model evaluation Partitioning techniques: -holdout -crossvaluation
Holdout
Fixed partitioning: reserve 2/3 for training 1/3 for testing
Appropriate for large datasets: may be repeated several times (repeated holdout)
Crossvalidation
Crossval:
- parition data into k disjoin subsets
- k-fold: train on k-1 paritions, test on te remaining one, repeat for all folds
- reliable accuracy estimation, not appropriate for very large datasets.
- this is used for estimating accuracy, but at the end the model is retrained on the whole labeled dataset.
Leave-one-out
- cros validation for k=n
- only appropriate for very small datasets
Limitations of accuracy
Cardinality calss 0: 9900 Cardinality class 1: 100 Model: () -> 0 Accuracy is not appropriate for unabalanced class label distribution and different class relevance
Recall,Precision
Recall = Number of objects assigned to c / Number of objects belonging to C Precision = Number of objects assigned correctly to c / Number of objects assigned to c
We need to balance the two:
maximize ( F-measure = 2rp/(r+p) )
Not always the F measure is important, sometimes we want to maximize one of the two, or another combined measure.
ROC
y: TPR (True positive rate): TP/(TP+FN)
x: FPR (False positive rate) FP/(FP+TN)
TP: true positive, FN: false negative, FP: False positive
Hierarchical clustering, intro
Produces a set of nested clusters organized as hierarchical tree.
We iterate and cluster nested clusters with unclustered points or other nested clusters. The algorithm ends when there is only one cluster, we can then cut the dendogram at a lower level to get a number > 1 of clusters.
The dendogram shows an the y axis the logical distance between the merged clusters, by means of the chosen metric.
Memory complexity: O(n^2)
Time complexity: O(n^3) some times reduced to O(n^2*log(n))
Hierarchical clustering, algo
Basic algorithm is straightforward Compute the proximity matrix Let each data point be a cluster Repeat Merge the two closest clusters Update the proximity matrix Until only a single cluster remains
MIN, MAX, Group Average
Proximity measures between clusters: please note that the first step in MIN is the same of MAX, finding the minimum distance between points, we use other proximity measures when we have clusters, not single points.
MIN:
- PROS: can handle non elliptical shapes -> k-means can’t
- cons: sensitive to noise and outliers
MAX: select in the matrix the minimum max distance between points in the clusters
- PROS: less susceptible to noise and outliers
- Cons: tends to break large clusters, and is biased towards globular clusters
Group average: Compromise between MIN and MAX
-pros: less susceptible to noise and outliers
-cons: biased toward globular clusters
Ward’s Method:
- Similarity of two clusters is based on the increase in squared error when two clusters are merged
- Analogue of K-means, can be used to initalize K-means
DBSCAN
The only parameter is density and minPts: it uses density to define clusters, the number of clusters is decided by the algo.
MinPts is the minimum number of points in the cluster.
Density: number of points within a specified radius (Eps)
- A point is a core point if it has more than a specified number of points (MinPts) within Eps: These are points that are at the interior of a cluster
- A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point
- A noise point is any point that is not a core point or a border point.
Pros:
- Resistant to Noise
- Can handle clusters of different shapes and sizes
Cons:
- Doesn’t recogniezes clusters with non-uniform densities and high dimensional data.
- We can counter that by applying the algortihm two times, one with high density and one with higher density on the noise points of the first algo.
Determining Epsilon: distance of every point to its kth nearest neighbour.
Measures of cluster
Cluster cohesion: measures how closely related are objects in a cluster (WSS)
Cluster separation: measure how distinct or well-separated a cluster is from other clusters (BSS)
Silhoutte: a succint measure to evaluate how well each object lies within its cluster
- for each object i
- a(i) the average dissimilarity of i with all other data within the samecluster (the smaller the value, the better the assignment)
- b(i) is the lowest average dissimilarity of i to any other cluster, of which i is not a member
- the average s(i) over all that of the data of a cluster measures how tightly grouped all the data in a cluster is
- s(i) = (b(i)-a(i))/max{a(i),b(i)}
The validation of clustering structures is the most difficult and frustrating part of cluster analysis.
Application of cluster analysis
- Understanding: group related documents for browsing, group genes and proteins that have similar functionality, group stock with similar price fluctuations
- Summarization: reduce size of data sets
Types of clustering
Partitioninal: division into non overlapping subsets, such that each data object is in exactly one subset
Hierachical : a set of nested clusters
Exclusive vs non exclusive
Fuzzy: a point belongs to every cluster with some weight
Partial versus complete: cluster only some data
Heterogeneus: widely different clusters
K-means
Partitional clustering approach
Each cluster is associated with a centroid (center point)
Each point is assigned to the cluster with the closest centroid
Number of clusters, K, must be specified
The basic algorithm is very simple:
1.Select k points as initial records
2.repeat
3. form k clusters by assigning all points to the closest centroid
4. recompute the centroid of each cluster
5.until the cetroids don’t change
Choosing the inital centroids is very important, with k, the number of centroids.