Data Science using Python and R - 10 Flashcards
What is clustering?
Clustering refers to the grouping of records, observations, or cases into classes of similar objects.
How does clustering differ from classification?
Clustering does not have a target variable, while classification does.
What is a cluster?
A cluster is a collection of records that are similar to one another and dissimilar to records in other clusters.
What is the goal of clustering algorithms?
To segment the entire data set into relatively homogeneous subgroups or clusters.
What is an example of a clustering application in business?
Target marketing of a niche product for a small-capitalization business.
What is the significance of within-cluster and between-cluster variation?
Clusters should have small within-cluster variation compared to the between-cluster variation.
What is the k-means clustering algorithm?
A straightforward and effective algorithm for finding clusters in data.
What is the first step in the k-means clustering algorithm?
Ask the user how many clusters k the data set should be partitioned into.
What is the ‘nearest’ criterion commonly used in k-means clustering?
Euclidean distance.
What is the centroid in k-means clustering?
The center of gravity of the points in a cluster.
What does the k-means algorithm do when it converges?
It terminates when the centroids no longer change.
What are the predictor variables used in the white wine clustering example?
Alcohol and sugar.
What is important to avoid bias in clustering results?
Do not include the target variable as an input to the clustering algorithm.
What does standardizing or normalizing predictors help with?
It ensures that the greater variability of one predictor does not dominate the cluster construction process.
What are the two clusters identified in the white wine example?
- Cluster 1: Sweet Wines
- Cluster 2: Dry Wines
What is a key cluster validation method?
Reapply the k-means algorithm to the test data set and compare results.
What should be done to validate clustering results?
Perform two-sample t-tests to compare means.
How do you load the required packages for k-means clustering in Python?
import pandas as pd
from scipy import stats
from sklearn.cluster import KMeans
What command is used to standardize predictor variables in Python?
stats.zscore()
What does the fit() command do in the k-means algorithm?
Runs the specified k-means algorithm on the data set.
What command separates records into two groups based on cluster membership in Python?
Xz.loc[cluster == 0] and Xz.loc[cluster == 1]
What is the R command to subset predictor variables?
subset(wine_train, select = c(‘alcohol’, ‘sugar’))
What command in R standardizes variables?
scale()
What is the purpose of running k-means clustering on both training and test data sets?
To validate the clustering results.
What is the purpose of the scale() command in R for clustering?
The scale() command turns the variables in X into their respective z-scores
Standardization is essential for clustering as it ensures that all variables contribute equally to the distance calculations.
What is the output format required for running the kmeans() command?
Data frame format
The kmeans() function in R requires inputs to be in a data frame to perform clustering.
What does the kmeans() function require as inputs?
The data frame and the number of clusters
In this example, the number of clusters specified is 2.
What does the command as.factor() do in the context of k-means clustering?
It saves the cluster membership as a factor
This allows for categorical representation of cluster memberships.
How do you separate records into clusters in R?
Using the which() command
This command selects records based on their cluster membership.
What command is used to obtain descriptive statistics of each cluster in R?
summary() command
This command provides statistical summaries for the specified clusters.
True or False: k-means clustering automatically selects the optimal number of clusters.
False
The user must specify the number of clusters beforehand.
Why is it important to standardize numerical predictors before clustering?
To ensure that all variables contribute equally to the clustering process
Without standardization, variables with larger ranges could disproportionately influence the results.
What is a centroid in the context of clustering?
The mean point of all points in a cluster
The centroid represents the center of a cluster in k-means clustering.
What is the first step in validating clusters in k-means clustering?
Inputting the test data set and performing variable standardization
This ensures that the same preprocessing is applied to both training and test datasets.
Fill in the blank: The command used to save the clustering algorithm output is _______.
kmeans01
This variable stores the results of the k-means clustering process.
What is the purpose of the colnames() command in the clustering process?
To edit the column names of the standardized data frame
This helps to indicate that the variables are now standardized.
What does the output of kmeans01$cluster represent?
Each record’s cluster membership
The output indicates which cluster each record belongs to (1 or 2 in this case).
What are the two main data sets mentioned for clustering analysis?
white_wine_training and white_wine_test
These datasets are used for training and validating the clustering model.
What is the significance of the variable ‘centers’ in the kmeans() function?
It specifies the number of clusters to form
In the provided example, it is set to 2.
What does the subset() function do in the context of the test data set?
It selects specific variables, alcohol and sugar
This prepares the data for clustering by focusing on relevant predictors.