10 - Clustering Flashcards
What is clustering?
Clustering refers to the grouping of records, observations, or cases into classes of similar objects.
How does clustering differ from classification?
Clustering does not have a target variable, while classification does.
What is a cluster?
A cluster is a collection of records that are similar to one another and dissimilar to records in other clusters.
What is the goal of clustering algorithms?
To segment the entire data set into relatively homogeneous subgroups or clusters.
What is an example of a clustering application in business?
Target marketing of a niche product for a small-capitalization business.
What is the significance of within-cluster and between-cluster variation?
Clusters should have small within-cluster variation compared to the between-cluster variation.
What is the k-means clustering algorithm?
A straightforward and effective algorithm for finding clusters in data.
What is the first step in the k-means clustering algorithm?
Ask the user how many clusters k the data set should be partitioned into.
What is the ‘nearest’ criterion commonly used in k-means clustering?
Euclidean distance.
What is the centroid in k-means clustering?
The center of gravity of the points in a cluster.
What does the k-means algorithm do when it converges?
It terminates when the centroids no longer change.
What are the predictor variables used in the white wine clustering example?
Alcohol and sugar.
What is important to avoid bias in clustering results?
Do not include the target variable as an input to the clustering algorithm.
What does standardizing or normalizing predictors help with?
It ensures that the greater variability of one predictor does not dominate the cluster construction process.
What are the two clusters identified in the white wine example?
- Cluster 1: Sweet Wines
- Cluster 2: Dry Wines
What is a key cluster validation method?
Reapply the k-means algorithm to the test data set and compare results.
What should be done to validate clustering results?
Perform two-sample t-tests to compare means.
How do you load the required packages for k-means clustering in Python?
import pandas as pd
from scipy import stats
from sklearn.cluster import KMeans
What command is used to standardize predictor variables in Python?
stats.zscore()
What does the fit() command do in the k-means algorithm?
Runs the specified k-means algorithm on the data set.
What command separates records into two groups based on cluster membership in Python?
Xz.loc[cluster == 0] and Xz.loc[cluster == 1]
What is the R command to subset predictor variables?
subset(wine_train, select = c(‘alcohol’, ‘sugar’))
What command in R standardizes variables?
scale()
What is the purpose of running k-means clustering on both training and test data sets?
To validate the clustering results.