Clustering Flashcards
What is clustering?
Clustering is grouping data into categories based on similarities. - grouping similar objects
Which types of clustering have we worked with?
K-means clustering and K-modes
What does unsupervised clustering remind us of?
classification in supervised learning
What is K-means clustering?
K-means clustering is clustering that works on parametric data (interval and ratio). (notice the means.)
How does K-means clustering work?
Initially, you split the data into K clusters. Then you initialize some random cluster centroids and calculate the euclidian-distance (distance) to each cluster centroid, this iterates until convergence.
note: means - we work with means
What is K-modes clustering?
it is like the K-means but it works on non-parametric data.
Where do we use K-modes clustering?
We use it in data mining, when we, for example, want to cluster non-parametric data as gender.
With K-mean clustering we might not have labels. Can we use K-mean clustering to create labels?
Yes, it is kind of good sometimes but not precise, as that would take supervised overview to confirm.
What is the difference between how K-means and K-modes work?
K-means deal with the means and the k-modes deal with modes.
Modes are looking at features and how similar they are to each other. If the features are the same the difference is set to 0 and if they are different they are set to 1. And then we measure the distance.
Can you use clustering on surveys?
Yes if you have a lot of responses in the Likert-scale you can cluster the answers and do PCA to visualize it.
What is the danger of using clustering on surveys
You might not notice if an outlier cluster is causation of bad data
How can you find the number of clusters you should find?
The elbow method.
What is paradoxical with the clustering problem?
The optimal solution is to have x clusters for x number of responses as the clusters would describe the data exactly 1:1
What is the elbow method
The elbow method finds the point on the graph where the number of clusters relative to the sum of errors is getting less steep.
In short where one extra cluster starts minimising the error less