Week 4: Clustering Flashcards
Clustering
Can overlapping or non-overlapping regions. Generally clusters are made by finding groupings of instances that have small intra-cluster distances and larger inter-cluster distances.
Dendrogram
Represents a hierarchy of clusters.
K-Means Clustering
Instances are assigned to the closest centroid, with centroid recomputed using the averages of all the members in its cluster.
Density
The density of a data point x is the number of data points within a specified radius Eps, including x
Categorisation of Data Points
Core Point: It must have a density above the threshold MinPts.
Border Point: It’s not a core point, but within the neighbourhood of a core point.
Noise Point: It’s neither a core point nor a border point.
DBSCAN Algorithm
It’s a density-based clustering algorithm. Identifies high density regions separate by low density regions between them. It ignore noise points.
Finding the optimal Eps and MinPts is a key issue. Can use the elbow rule when plotting i-th nearest neighbour distance against points sorted by distance to i-th nearest neighbour.
Graph-based Algorithm
Given a dataset, construct a proximity matrix and sparse graph (proximity matrix and sparse graph are built together iteratively), and partition the graph, with the partitions becoming clusters.
Sparsification
After clustering the nodes, filter out the edges so that the structure is accentuated in the sparsified graph.
Graph Partitioning
Graph cutting algorithms can be utilised, with the goal of finding a subset of edges whose deletion cuts a graph into k-connected components, with the weights of the connecting nodes minimised (minimum k-cut).
Incremental Clustering
It approaches each cluster instance 1-by-1 in an online manner without needing simultaneous access to all instances.
Main points:
1. At each stage, the overall clustering forms a tree.
2. The root node represents the entire dataset.
3. Leaves correspond to instances
4. Nodes with the same parent form a cluster.
Steps:
1. Initialise a tree containing only the root node.
2. With the addition of a new instances, either a leaf for the new instance is added, or the tree is restructured by merging/splitting two nodes. Decisions are made based on the category utility metric, which measure the overall quality of a partition of instances into clusters.
Fuzzy Cluster
Clusters whose elements have degrees of membership.
Distance Metrics
3 Main Types:
- Distance between the values of 2 different instances with respect to an individual attribute.
- Distance between two difference instances which aggregates all attributes.
- Distance between two different clusters which aggregates all instances.
Cluster Centroid for Numerical Data
Calculate the arithmetic mean for each attribute using all cluster points.
Cluster Medoid for Nominal Data
The most representative point for a cluster.
Cluster Diameter
The largest distance between any two points in a cluster.