04. K-Means Clustering Flashcards

Question 1

Q

What is clustering

Answer

A

Unsupervised, labels are not proscribed in advance. Clustering is exploratory analysis and does not predict, but determines the objects of interest and how best to group them. Clustering methods find the similarities between objects according to the object attributes and group the similar objects into clusters.

Question 2

Q

What does k-means clustering do

Answer

A

K-Means clustering partitionsnobjects intokclusters in which each object belongs to the cluster with the nearest mean.

Question 3

Q

List use cases for k-means clustering

Answer

A

Image Processing, Medical Attributes, Customer Segmentation

Question 4

Q

List the steps of the k-means algorithm

Answer

A

Choose/calculate the number of clusters k
2 Select k points at random as clusters
Calculate the distance between objects and every centre (in n dimensions)
Create clusters by assigning objects to their closest cluster
Calculate the centroid of mean of all objects in each cluster
Repeat steps 3, 4 & 5 until the same points are assigned to each cluster in consecutive iterations

Question 5

Q

Name the theorem and name of the distance in k-means clustering

Answer

A

In two dimensions the Euclidian Distance is the same as in the Pythagorean Theorem SQRT(x2+y2) but in more dimensions this becomes SQRT(x2+y2+z2)

Question 6

Q

In R what is the syntax for k-means clustering

Answer

A

OutputDataFrame = kmeans(InputDataFrame, 3)

Question 7

Q

In k-means clustering what does WSS stand for

Answer

A

Within Sum of Squares. Looking for the lowest WSS is used in assessing if its worth increasing or decreasing the chosen number of clusters.

Question 8

Q

In k-means clustering when looking at a “Within Sum of Squares” (WSS) graph how do you determine the optimal number of clusters

Answer

A

At clusters = 1 the WSS is high, this reduces rapidly as the count of clusters increase, however, the slope flattens out. Look for the elbow where the slope flattens.

Question 9

Q

Which three things remain as choices for the user to define for a k-means clustering

Answer

A

What object attributes should be included in the analysis?
What unit of measure (for example, miles or kilometers) should be used for each attribute?
Do the attributes need to be rescaled so that one attribute does not have a disproportionate effect on the results?

Question 10

Q

How can you improve the underlying data set for a k-means clustering

Answer

A

Reduce the number of attributes where possible. See if there are any strong correlations between attributes (do a scatter plot and look for linear relationships) and either remove or combine these attributes (in a ratio).

Question 11

Q

Why does r run multiple k-means methods for the answering of one question

Answer

A

The randomly chosen starting centroid can have an effect on the answer, hence, you need to undertake a number of iterations with different starting points to have a more effective result

Question 12

Q

What types of data does k-means handle well

Answer

A

Numerical data, but not categorical data (ordinal, nominal, binomial)

Question 13

Q

What is the R syntax for reviewing a k-means clustering model

Answer

A

OutputDataFrame…
…$centers cluster means
…$size objects per cluster
…$cluster vector of objects in clusters
…$betweenss the between clusters sum of squares
….$withinss the within cluster sum of squares
…$totss withinss + betweenss

04. K-Means Clustering Flashcards

(13 cards)