3 Main Types: 1. Distance between the values of 2 different instances with respect to an individual attribute. 2. Distance between two difference instances which aggregates all attributes. 3. Distance between two different clusters which aggregates all instances.

Week 4: Clustering Flashcards by Henry Cao

Clustering

Can overlapping or non-overlapping regions. Generally clusters are made by finding groupings of instances that have small intra-cluster distances and larger inter-cluster distances.

How well did you know this?

Not at all

Perfectly

Dendrogram

Represents a hierarchy of clusters.

How well did you know this?

Not at all

Perfectly

K-Means Clustering

Instances are assigned to the closest centroid, with centroid recomputed using the averages of all the members in its cluster.

How well did you know this?

Not at all

Perfectly

Density

The density of a data point x is the number of data points within a specified radius Eps, including x

How well did you know this?

Not at all

Perfectly

Categorisation of Data Points

Core Point: It must have a density above the threshold MinPts.
Border Point: It’s not a core point, but within the neighbourhood of a core point.
Noise Point: It’s neither a core point nor a border point.

How well did you know this?

Not at all

Perfectly

DBSCAN Algorithm

It’s a density-based clustering algorithm. Identifies high density regions separate by low density regions between them. It ignore noise points.

Finding the optimal Eps and MinPts is a key issue. Can use the elbow rule when plotting i-th nearest neighbour distance against points sorted by distance to i-th nearest neighbour.

How well did you know this?

Not at all

Perfectly

Graph-based Algorithm

Given a dataset, construct a proximity matrix and sparse graph (proximity matrix and sparse graph are built together iteratively), and partition the graph, with the partitions becoming clusters.

How well did you know this?

Not at all

Perfectly

Sparsification

After clustering the nodes, filter out the edges so that the structure is accentuated in the sparsified graph.

How well did you know this?

Not at all

Perfectly

Graph Partitioning

Graph cutting algorithms can be utilised, with the goal of finding a subset of edges whose deletion cuts a graph into k-connected components, with the weights of the connecting nodes minimised (minimum k-cut).

How well did you know this?

Not at all

Perfectly

Incremental Clustering

It approaches each cluster instance 1-by-1 in an online manner without needing simultaneous access to all instances.

Main points:
1. At each stage, the overall clustering forms a tree.
2. The root node represents the entire dataset.
3. Leaves correspond to instances
4. Nodes with the same parent form a cluster.

Steps:
1. Initialise a tree containing only the root node.
2. With the addition of a new instances, either a leaf for the new instance is added, or the tree is restructured by merging/splitting two nodes. Decisions are made based on the category utility metric, which measure the overall quality of a partition of instances into clusters.

How well did you know this?

Not at all

Perfectly

Fuzzy Cluster

Clusters whose elements have degrees of membership.

How well did you know this?

Not at all

Perfectly

Distance Metrics

3 Main Types:

Distance between the values of 2 different instances with respect to an individual attribute.
Distance between two difference instances which aggregates all attributes.
Distance between two different clusters which aggregates all instances.

How well did you know this?

Not at all

Perfectly

Cluster Centroid for Numerical Data

Calculate the arithmetic mean for each attribute using all cluster points.

How well did you know this?

Not at all

Perfectly

Cluster Medoid for Nominal Data

The most representative point for a cluster.

How well did you know this?

Not at all

Perfectly

Cluster Diameter

The largest distance between any two points in a cluster.

How well did you know this?

Not at all

Perfectly

Single-Link/Single Linkage

Study These Flashcards

The minimum distance between any two points in the two different clusters.

Complete-Link/Complete-Linkage

Study These Flashcards

The maximum distance between any two points in the two different clusters.

Average-Link/Average-Linkage/Group Average

Study These Flashcards

The average distance among all pairs of instances in the two clusters.

Centroid-Link/Centroid-Linkage

Study These Flashcards

The distance of the two centroids of the two clusters.

Within Cluster Score

Study These Flashcards

The measure of cluster compactness, which is sum of square of the distance between a point and its cluster centroid.

Between Cluster Score

Study These Flashcards

Measure of cluster separation, which is sum of the square of the distances between each pair of centroids.

Calinski-Harabaz Index

Study These Flashcards

Based on the sum of squared distances of the points to their closest centroids. High for when the clusters are well separated and dense.

(\frac{BC}{WC}) \times (\frac{n-K}{K-1})

n = number of instances
K = number of clusters
BC = between cluster sum of squared distances
WC = within cluster sum of squared distances

Silhouette Coefficient

Study These Flashcards

s_i = \frac{b_i - a_i}{\max \left{a_i, b_i \right}}

s_i = 1, for dense and well-separated clusters
s_i = 0, for overlapping cluster
s_i = -1, for incorrect clustering

a_i = \frac{1}{m_k - 1} \sum_{i^{‘} C_k \ \left{i\right}} d(\boldsymbol{x}i, \boldsymbol{x}{i^{‘}})

a_i is the mean distance between \boldsymbol{x}_i and all other data points in the same cluster C_k

b_i = \frac{1}{m_{k^{‘}}} \sum_{i^{‘} \in c_{k^{‘}}} d(\boldsymbol{x}i, \boldsymbol{x}{i^{‘}})

B_i is the mean distance between an instance an all other points in the nearest cluster C_{k^{‘}}

Minimum Description Length (MDL)

Study These Flashcards

The best model is the simplest. The fewer bits it takes to encode the model, the better. We want to maximise the posterior probability Pr(h \mid D), i.e. the chance that the model h is correct given the set of examples D.

Outlier/Anomaly Detection

In supervised learning, human experts can do this manually via classification. In semi-supervised learning, collecting data with outliers is more difficult. But normal instances are more easily attainable. In unsupervised learning, there needs to be assumptions for defining when a data point is far from other data.

Model-Based Outlier

An observation that differs so much from other observations to arouse suspicion that it has been generated by a different mechanism.

Proximity-Based Outlier

Object that have very high distance from its k-nearest neighbour: O = \left\{\boldsymbol{x}_i : d_k(\boldsymbol{x}) \ge T, 1 \le I \le n \right\}

Density

The inverse of the average distance to the k-nearest neighbors. \delta_k (\boldsymbol{x}) = \left( \frac{1}{\left\lvert N_k(\boldsymbol{x}) \right\vert} \sum_{\boldsymbol{x}^{'} \in N_k(\boldsymbol{x})} d(\boldsymbol{x}, \boldsymbol{x}^{'}) \right)^{-1}

Relative Density

This helps find outliers even when density differs across different regions. r_k(\boldsymbol{x}) = \frac{\delta_k(\boldsymbol{x})}{\frac{1}{\left\lvert N_k (\boldsymbol{x}) \right\rvert}\sum_{\boldsymbol{x}^{'} \in N_k(\boldsymbol{x})} \delta_k(\boldsymbol{x}^{'})}

Cluster-Based Outlier

An object that doesn't strongly belong to any cluster.

Probabilistic Outlier

An object that has low probability with respect to a probabilistic model of the data.

Week 4: Clustering Flashcards

(31 cards)