04. K-Means Clustering Flashcards
What is clustering
Unsupervised, labels are not proscribed in advance. Clustering is exploratory analysis and does not predict, but determines the objects of interest and how best to group them. Clustering methods find the similarities between objects according to the object attributes and group the similar objects into clusters.
What does k-means clustering do
K-Means clustering partitionsnobjects intokclusters in which each object belongs to the cluster with the nearest mean.
List use cases for k-means clustering
Image Processing, Medical Attributes, Customer Segmentation
List the steps of the k-means algorithm
- Choose/calculate the number of clusters k
2 Select k points at random as clusters - Calculate the distance between objects and every centre (in n dimensions)
- Create clusters by assigning objects to their closest cluster
- Calculate the centroid of mean of all objects in each cluster
- Repeat steps 3, 4 & 5 until the same points are assigned to each cluster in consecutive iterations
Name the theorem and name of the distance in k-means clustering
In two dimensions the Euclidian Distance is the same as in the Pythagorean Theorem SQRT(x2+y2) but in more dimensions this becomes SQRT(x2+y2+z2)
In R what is the syntax for k-means clustering
OutputDataFrame = kmeans(InputDataFrame, 3)
In k-means clustering what does WSS stand for
Within Sum of Squares. Looking for the lowest WSS is used in assessing if its worth increasing or decreasing the chosen number of clusters.
In k-means clustering when looking at a “Within Sum of Squares” (WSS) graph how do you determine the optimal number of clusters
At clusters = 1 the WSS is high, this reduces rapidly as the count of clusters increase, however, the slope flattens out. Look for the elbow where the slope flattens.
Which three things remain as choices for the user to define for a k-means clustering
What object attributes should be included in the analysis?
What unit of measure (for example, miles or kilometers) should be used for each attribute?
Do the attributes need to be rescaled so that one attribute does not have a disproportionate effect on the results?
How can you improve the underlying data set for a k-means clustering
Reduce the number of attributes where possible. See if there are any strong correlations between attributes (do a scatter plot and look for linear relationships) and either remove or combine these attributes (in a ratio).
Why does r run multiple k-means methods for the answering of one question
The randomly chosen starting centroid can have an effect on the answer, hence, you need to undertake a number of iterations with different starting points to have a more effective result
What types of data does k-means handle well
Numerical data, but not categorical data (ordinal, nominal, binomial)
What is the R syntax for reviewing a k-means clustering model
OutputDataFrame…
…$centers cluster means
…$size objects per cluster
…$cluster vector of objects in clusters
…$betweenss the between clusters sum of squares
….$withinss the within cluster sum of squares
…$totss withinss + betweenss