k means/ k-medoid Flashcards

1
Q

k means/ k-mediod is supervised/ unsupervised learning

A

Unsupervised learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Primary goal of k-means algo

A

Primary goal of k-means clustering is to minimise the distance between the points in the same cluster.
While the algo indirectly increases the separation between clusters by minimising intra cluster distances, its directive object is not to maximise inter-cluster distances.

The goal of k-means is to minimise the sum of the squared distances between each data point and its centroid.
The algorithm aims to minimize the sum of squared distances between data points and their respective cluster centroids.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How k means algo grouped data

A

It groups the data into k-clusters by minimising the variance within each cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

k means clustering is __________

A

k means clustering is partitioning clustering.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

SSE in k-means clustering

A

SSE stands for sum of squared error
SSS = ∑d(x-c)²
here d:distance, x:data point, c:centroid of cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Explain SSE for k=1, k=2 and k=3

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

k value and SSE graph

A

Elbow curve

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Point remember in k-means problem solving

A

If data point difference to the first centroid, data point difference to the second centroid both are matched. You can use any cluster to keep that point. Automatically, in next iteration, it will be proper classified because mean value changed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Which is more efficient (k means/ hierarchal clustering)

A

k means

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Time complexity of k-means algo

A

O(tkn)
Here
n: number of data points
k: number of clusters
t: number of iterations
Since both k and t is small. k-means is considered a linear algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Stopping/ convergence criterion of k-means algo

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

k-means for categorical data

A

For categorical data, we use k-mode instead of k-means.
The centroid is represented by most frequent values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Outliers vs k-means

A

k-means is sensitive to outliers.
* Outliers are data points that are very far away from other data points.
* Outliers could be errors in dat arecording or special data points with very different values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

WCSS in k-means
* full form
* Definition
* Low WCSS and high WCSS means

A

WCSS- Within cluster sum of squares
WCSS is a metric used to evaluate the quality of the clusters formed by the k-means clustering algorithm.
It measures the sum of sqaured distances between each data point and the centroid of the cluster to which it belongs.

A lower WCSS value indicates that the data points are closer to their respective cluster centroids, which means the clusters are more compact and better defined.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

WCSS formulas (2)
Also relation between both formulas with example

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Goal of k-means in terms of WCSS

A

Minimization of WCSS is the goal of the k-means algorithm. The algorithm iteratively tries to adjust the positions of the centroids to minimise the WCSS.

17
Q

Application of WCSS

A

WCSS used to determine the optimal number of clusters using the elbow method, which helps find a balance between number of clusters and how well the data is clustered.

18
Q

Methods to determine the optimal number of clusters (K) in k-means algo

A

1.) The elbow method
2.) The silhouette method

19
Q

The elbow method (definition)

A

It is based on the idea that increasing the number of clusters K will reduce WCSS, but after a certain point, the improvement will diminish, forming an ‘elbow’ in the curve.

20
Q

How to identify the elbow

A

The elbow is where the plot starts to bend or flatten out. This indicates the number of clusters.

21
Q

The silhouette method

A

It measure how similar a data point to its own cluster (cohesion) compared to other clusters (separation).
It computes the silhouette coefficient for each point, which quantifies how well a point fits into its assigned cluster.
The average silhouette score across all points can be used to evaluate different values of k.

22
Q

silhouette coëfficiënt

A
23
Q

silhouette coëfficiënt range

A

[-1,1]

24
Q

Three points about silhouette coëfficiënt

A
25
Q

Comparison of elbow and silhouette methods

A
26
Q

k-medoids - supervised/unsupervised

A

unsupervised

27
Q

How k-medoid different from k-means

A

k-medoid is an improvised version of k-means algo mainly designed to deal with outlier data sensitivity.

28
Q

Compare k-medoid with other portioning algos

A

Compared to other portioning altos, the algorithm is simple, fast and easy to implement.

29
Q

methods difference of k-means and k-medoid

A

k-medoid clustering method unlike k means, rather than minimising the sum of squared distances, k-medoids works on minimising the number of paired dissimilarities.

30
Q

instead of centroids that we have in k-medoid
also define it

A

Instead of centroids, k-medoid approach makes use of medoids.
Medoids are points in the dataset whose sum of distances to other points in cluster is minimal.
Unlike K-Means, where clusters are represented by centroids (which may not be actual data points), K-Medoids selects actual data points as cluster centers.
A medoid is the data point in a cluster that has the smallest total dissimilarity to all other points in that cluster.

31
Q
A