Week 4: Clustering Flashcards
Clustering
Can overlapping or non-overlapping regions. Generally clusters are made by finding groupings of instances that have small intra-cluster distances and larger inter-cluster distances.
Dendrogram
Represents a hierarchy of clusters.
K-Means Clustering
Instances are assigned to the closest centroid, with centroid recomputed using the averages of all the members in its cluster.
Density
The density of a data point x is the number of data points within a specified radius Eps, including x
Categorisation of Data Points
Core Point: It must have a density above the threshold MinPts.
Border Point: It’s not a core point, but within the neighbourhood of a core point.
Noise Point: It’s neither a core point nor a border point.
DBSCAN Algorithm
It’s a density-based clustering algorithm. Identifies high density regions separate by low density regions between them. It ignore noise points.
Finding the optimal Eps and MinPts is a key issue. Can use the elbow rule when plotting i-th nearest neighbour distance against points sorted by distance to i-th nearest neighbour.
Graph-based Algorithm
Given a dataset, construct a proximity matrix and sparse graph (proximity matrix and sparse graph are built together iteratively), and partition the graph, with the partitions becoming clusters.
Sparsification
After clustering the nodes, filter out the edges so that the structure is accentuated in the sparsified graph.
Graph Partitioning
Graph cutting algorithms can be utilised, with the goal of finding a subset of edges whose deletion cuts a graph into k-connected components, with the weights of the connecting nodes minimised (minimum k-cut).
Incremental Clustering
It approaches each cluster instance 1-by-1 in an online manner without needing simultaneous access to all instances.
Main points:
1. At each stage, the overall clustering forms a tree.
2. The root node represents the entire dataset.
3. Leaves correspond to instances
4. Nodes with the same parent form a cluster.
Steps:
1. Initialise a tree containing only the root node.
2. With the addition of a new instances, either a leaf for the new instance is added, or the tree is restructured by merging/splitting two nodes. Decisions are made based on the category utility metric, which measure the overall quality of a partition of instances into clusters.
Fuzzy Cluster
Clusters whose elements have degrees of membership.
Distance Metrics
3 Main Types:
- Distance between the values of 2 different instances with respect to an individual attribute.
- Distance between two difference instances which aggregates all attributes.
- Distance between two different clusters which aggregates all instances.
Cluster Centroid for Numerical Data
Calculate the arithmetic mean for each attribute using all cluster points.
Cluster Medoid for Nominal Data
The most representative point for a cluster.
Cluster Diameter
The largest distance between any two points in a cluster.
Single-Link/Single Linkage
The minimum distance between any two points in the two different clusters.
Complete-Link/Complete-Linkage
The maximum distance between any two points in the two different clusters.
Average-Link/Average-Linkage/Group Average
The average distance among all pairs of instances in the two clusters.
Centroid-Link/Centroid-Linkage
The distance of the two centroids of the two clusters.
Within Cluster Score
The measure of cluster compactness, which is sum of square of the distance between a point and its cluster centroid.
Between Cluster Score
Measure of cluster separation, which is sum of the square of the distances between each pair of centroids.
Calinski-Harabaz Index
Based on the sum of squared distances of the points to their closest centroids. High for when the clusters are well separated and dense.
(\frac{BC}{WC}) \times (\frac{n-K}{K-1})
n = number of instances
K = number of clusters
BC = between cluster sum of squared distances
WC = within cluster sum of squared distances
Silhouette Coefficient
s_i = \frac{b_i - a_i}{\max \left{a_i, b_i \right}}
s_i = 1, for dense and well-separated clusters
s_i = 0, for overlapping cluster
s_i = -1, for incorrect clustering
a_i = \frac{1}{m_k - 1} \sum_{i^{‘} C_k \ \left{i\right}} d(\boldsymbol{x}i, \boldsymbol{x}{i^{‘}})
a_i is the mean distance between \boldsymbol{x}_i and all other data points in the same cluster C_k
b_i = \frac{1}{m_{k^{‘}}} \sum_{i^{‘} \in c_{k^{‘}}} d(\boldsymbol{x}i, \boldsymbol{x}{i^{‘}})
B_i is the mean distance between an instance an all other points in the nearest cluster C_{k^{‘}}
Minimum Description Length (MDL)
The best model is the simplest. The fewer bits it takes to encode the model, the better. We want to maximise the posterior probability Pr(h \mid D), i.e. the chance that the model h is correct given the set of examples D.
Outlier/Anomaly Detection
In supervised learning, human experts can do this manually via classification.
In semi-supervised learning, collecting data with outliers is more difficult. But normal instances are more easily attainable.
In unsupervised learning, there needs to be assumptions for defining when a data point is far from other data.
Model-Based Outlier
An observation that differs so much from other observations to arouse suspicion that it has been generated by a different mechanism.
Proximity-Based Outlier
Object that have very high distance from its k-nearest neighbour: O = \left{\boldsymbol{x}_i : d_k(\boldsymbol{x}) \ge T, 1 \le I \le n \right}
Density
The inverse of the average distance to the k-nearest neighbors.
\delta_k (\boldsymbol{x}) = \left( \frac{1}{\left\lvert N_k(\boldsymbol{x}) \right\vert} \sum_{\boldsymbol{x}^{‘} \in N_k(\boldsymbol{x})} d(\boldsymbol{x}, \boldsymbol{x}^{‘}) \right)^{-1}
Relative Density
This helps find outliers even when density differs across different regions.
r_k(\boldsymbol{x}) = \frac{\delta_k(\boldsymbol{x})}{\frac{1}{\left\lvert N_k (\boldsymbol{x}) \right\rvert}\sum_{\boldsymbol{x}^{‘} \in N_k(\boldsymbol{x})} \delta_k(\boldsymbol{x}^{‘})}
Cluster-Based Outlier
An object that doesn’t strongly belong to any cluster.
Probabilistic Outlier
An object that has low probability with respect to a probabilistic model of the data.