k means/ k-medoid Flashcards
k means/ k-mediod is supervised/ unsupervised learning
Unsupervised learning
Primary goal of k-means algo
Primary goal of k-means clustering is to minimise the distance between the points in the same cluster.
While the algo indirectly increases the separation between clusters by minimising intra cluster distances, its directive object is not to maximise inter-cluster distances.
The goal of k-means is to minimise the sum of the squared distances between each data point and its centroid.
The algorithm aims to minimize the sum of squared distances between data points and their respective cluster centroids.
How k means algo grouped data
It groups the data into k-clusters by minimising the variance within each cluster.
k means clustering is __________
k means clustering is partitioning clustering.
SSE in k-means clustering
SSE stands for sum of squared error
SSS = ∑d(x-c)²
here d:distance, x:data point, c:centroid of cluster
Explain SSE for k=1, k=2 and k=3
k value and SSE graph
Elbow curve
Point remember in k-means problem solving
If data point difference to the first centroid, data point difference to the second centroid both are matched. You can use any cluster to keep that point. Automatically, in next iteration, it will be proper classified because mean value changed.
Which is more efficient (k means/ hierarchal clustering)
k means
Time complexity of k-means algo
O(tkn)
Here
n: number of data points
k: number of clusters
t: number of iterations
Since both k and t is small. k-means is considered a linear algorithm
Stopping/ convergence criterion of k-means algo
k-means for categorical data
For categorical data, we use k-mode instead of k-means.
The centroid is represented by most frequent values
Outliers vs k-means
k-means is sensitive to outliers.
* Outliers are data points that are very far away from other data points.
* Outliers could be errors in dat arecording or special data points with very different values.
WCSS in k-means
* full form
* Definition
* Low WCSS and high WCSS means
WCSS- Within cluster sum of squares
WCSS is a metric used to evaluate the quality of the clusters formed by the k-means clustering algorithm.
It measures the sum of sqaured distances between each data point and the centroid of the cluster to which it belongs.
A lower WCSS value indicates that the data points are closer to their respective cluster centroids, which means the clusters are more compact and better defined.
WCSS formulas (2)
Also relation between both formulas with example
Goal of k-means in terms of WCSS
Minimization of WCSS is the goal of the k-means algorithm. The algorithm iteratively tries to adjust the positions of the centroids to minimise the WCSS.
Application of WCSS
WCSS used to determine the optimal number of clusters using the elbow method, which helps find a balance between number of clusters and how well the data is clustered.
Methods to determine the optimal number of clusters (K) in k-means algo
1.) The elbow method
2.) The silhouette method
The elbow method (definition)
It is based on the idea that increasing the number of clusters K will reduce WCSS, but after a certain point, the improvement will diminish, forming an ‘elbow’ in the curve.
How to identify the elbow
The elbow is where the plot starts to bend or flatten out. This indicates the number of clusters.
The silhouette method
It measure how similar a data point to its own cluster (cohesion) compared to other clusters (separation).
It computes the silhouette coefficient for each point, which quantifies how well a point fits into its assigned cluster.
The average silhouette score across all points can be used to evaluate different values of k.
silhouette coëfficiënt
silhouette coëfficiënt range
[-1,1]
Three points about silhouette coëfficiënt
Comparison of elbow and silhouette methods
k-medoids - supervised/unsupervised
unsupervised
How k-medoid different from k-means
k-medoid is an improvised version of k-means algo mainly designed to deal with outlier data sensitivity.
Compare k-medoid with other portioning algos
Compared to other portioning altos, the algorithm is simple, fast and easy to implement.
methods difference of k-means and k-medoid
k-medoid clustering method unlike k means, rather than minimising the sum of squared distances, k-medoids works on minimising the number of paired dissimilarities.
instead of centroids that we have in k-medoid
also define it
Instead of centroids, k-medoid approach makes use of medoids.
Medoids are points in the dataset whose sum of distances to other points in cluster is minimal.
Unlike K-Means, where clusters are represented by centroids (which may not be actual data points), K-Medoids selects actual data points as cluster centers.
A medoid is the data point in a cluster that has the smallest total dissimilarity to all other points in that cluster.