10 - Clustering Flashcards by Kaman Hung

What is clustering?

Clustering refers to the grouping of records, observations, or cases into classes of similar objects.

How well did you know this?

Not at all

Perfectly

How does clustering differ from classification?

Clustering does not have a target variable, while classification does.

How well did you know this?

Not at all

Perfectly

What is a cluster?

A cluster is a collection of records that are similar to one another and dissimilar to records in other clusters.

How well did you know this?

Not at all

Perfectly

What is the goal of clustering algorithms?

To segment the entire data set into relatively homogeneous subgroups or clusters.

How well did you know this?

Not at all

Perfectly

What is an example of a clustering application in business?

Target marketing of a niche product for a small-capitalization business.

How well did you know this?

Not at all

Perfectly

What is the significance of within-cluster and between-cluster variation?

Clusters should have small within-cluster variation compared to the between-cluster variation.

How well did you know this?

Not at all

Perfectly

What is the k-means clustering algorithm?

A straightforward and effective algorithm for finding clusters in data.

How well did you know this?

Not at all

Perfectly

What is the first step in the k-means clustering algorithm?

Ask the user how many clusters k the data set should be partitioned into.

How well did you know this?

Not at all

Perfectly

What is the ‘nearest’ criterion commonly used in k-means clustering?

Euclidean distance.

How well did you know this?

Not at all

Perfectly

What is the centroid in k-means clustering?

The center of gravity of the points in a cluster.

How well did you know this?

Not at all

Perfectly

What does the k-means algorithm do when it converges?

It terminates when the centroids no longer change.

How well did you know this?

Not at all

Perfectly

What are the predictor variables used in the white wine clustering example?

Alcohol and sugar.

How well did you know this?

Not at all

Perfectly

What is important to avoid bias in clustering results?

Do not include the target variable as an input to the clustering algorithm.

How well did you know this?

Not at all

Perfectly

What does standardizing or normalizing predictors help with?

It ensures that the greater variability of one predictor does not dominate the cluster construction process.

How well did you know this?

Not at all

Perfectly

What are the two clusters identified in the white wine example?

Cluster 1: Sweet Wines
Cluster 2: Dry Wines

How well did you know this?

Not at all

Perfectly

What is a key cluster validation method?

Reapply the k-means algorithm to the test data set and compare results.

How well did you know this?

Not at all

Perfectly

What should be done to validate clustering results?

Study These Flashcards

Perform two-sample t-tests to compare means.

How do you load the required packages for k-means clustering in Python?

Study These Flashcards

import pandas as pd
from scipy import stats
from sklearn.cluster import KMeans

What command is used to standardize predictor variables in Python?

Study These Flashcards

stats.zscore()

What does the fit() command do in the k-means algorithm?

Study These Flashcards

Runs the specified k-means algorithm on the data set.

What command separates records into two groups based on cluster membership in Python?

Study These Flashcards

Xz.loc[cluster == 0] and Xz.loc[cluster == 1]

What is the R command to subset predictor variables?

Study These Flashcards

subset(wine_train, select = c(‘alcohol’, ‘sugar’))

What command in R standardizes variables?

Study These Flashcards

scale()

What is the purpose of running k-means clustering on both training and test data sets?

Study These Flashcards

To validate the clustering results.

What is the purpose of the scale() command in R for clustering?

The scale() command turns the variables in X into their respective z-scores ## Footnote Standardization is essential for clustering as it ensures that all variables contribute equally to the distance calculations.

What is the output format required for running the kmeans() command?

Data frame format ## Footnote The kmeans() function in R requires inputs to be in a data frame to perform clustering.

What does the kmeans() function require as inputs?

The data frame and the number of clusters ## Footnote In this example, the number of clusters specified is 2.

What does the command as.factor() do in the context of k-means clustering?

It saves the cluster membership as a factor ## Footnote This allows for categorical representation of cluster memberships.

How do you separate records into clusters in R?

Using the which() command ## Footnote This command selects records based on their cluster membership.

What command is used to obtain descriptive statistics of each cluster in R?

summary() command ## Footnote This command provides statistical summaries for the specified clusters.

True or False: k-means clustering automatically selects the optimal number of clusters.

False ## Footnote The user must specify the number of clusters beforehand.

Why is it important to standardize numerical predictors before clustering?

To ensure that all variables contribute equally to the clustering process ## Footnote Without standardization, variables with larger ranges could disproportionately influence the results.

What is a centroid in the context of clustering?

The mean point of all points in a cluster ## Footnote The centroid represents the center of a cluster in k-means clustering.

What is the first step in validating clusters in k-means clustering?

Inputting the test data set and performing variable standardization ## Footnote This ensures that the same preprocessing is applied to both training and test datasets.

Fill in the blank: The command used to save the clustering algorithm output is _______.

kmeans01 ## Footnote This variable stores the results of the k-means clustering process.

What is the purpose of the colnames() command in the clustering process?

To edit the column names of the standardized data frame ## Footnote This helps to indicate that the variables are now standardized.

What does the output of kmeans01$cluster represent?

Each record’s cluster membership ## Footnote The output indicates which cluster each record belongs to (1 or 2 in this case).

What are the two main data sets mentioned for clustering analysis?

white_wine_training and white_wine_test ## Footnote These datasets are used for training and validating the clustering model.

What is the significance of the variable 'centers' in the kmeans() function?

It specifies the number of clusters to form ## Footnote In the provided example, it is set to 2.

What does the subset() function do in the context of the test data set?

It selects specific variables, alcohol and sugar ## Footnote This prepares the data for clustering by focusing on relevant predictors.

10 - Clustering Flashcards

(40 cards)