Data Science using Python and R - 10 Flashcards

1
Q

What is clustering?

A

Clustering refers to the grouping of records, observations, or cases into classes of similar objects.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How does clustering differ from classification?

A

Clustering does not have a target variable, while classification does.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a cluster?

A

A cluster is a collection of records that are similar to one another and dissimilar to records in other clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the goal of clustering algorithms?

A

To segment the entire data set into relatively homogeneous subgroups or clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is an example of a clustering application in business?

A

Target marketing of a niche product for a small-capitalization business.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the significance of within-cluster and between-cluster variation?

A

Clusters should have small within-cluster variation compared to the between-cluster variation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the k-means clustering algorithm?

A

A straightforward and effective algorithm for finding clusters in data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the first step in the k-means clustering algorithm?

A

Ask the user how many clusters k the data set should be partitioned into.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the ‘nearest’ criterion commonly used in k-means clustering?

A

Euclidean distance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the centroid in k-means clustering?

A

The center of gravity of the points in a cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does the k-means algorithm do when it converges?

A

It terminates when the centroids no longer change.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the predictor variables used in the white wine clustering example?

A

Alcohol and sugar.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is important to avoid bias in clustering results?

A

Do not include the target variable as an input to the clustering algorithm.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does standardizing or normalizing predictors help with?

A

It ensures that the greater variability of one predictor does not dominate the cluster construction process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the two clusters identified in the white wine example?

A
  • Cluster 1: Sweet Wines
  • Cluster 2: Dry Wines
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a key cluster validation method?

A

Reapply the k-means algorithm to the test data set and compare results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What should be done to validate clustering results?

A

Perform two-sample t-tests to compare means.

18
Q

How do you load the required packages for k-means clustering in Python?

A

import pandas as pd
from scipy import stats
from sklearn.cluster import KMeans

19
Q

What command is used to standardize predictor variables in Python?

A

stats.zscore()

20
Q

What does the fit() command do in the k-means algorithm?

A

Runs the specified k-means algorithm on the data set.

21
Q

What command separates records into two groups based on cluster membership in Python?

A

Xz.loc[cluster == 0] and Xz.loc[cluster == 1]

22
Q

What is the R command to subset predictor variables?

A

subset(wine_train, select = c(‘alcohol’, ‘sugar’))

23
Q

What command in R standardizes variables?

24
Q

What is the purpose of running k-means clustering on both training and test data sets?

A

To validate the clustering results.

25
Q

What is the purpose of the scale() command in R for clustering?

A

The scale() command turns the variables in X into their respective z-scores

Standardization is essential for clustering as it ensures that all variables contribute equally to the distance calculations.

26
Q

What is the output format required for running the kmeans() command?

A

Data frame format

The kmeans() function in R requires inputs to be in a data frame to perform clustering.

27
Q

What does the kmeans() function require as inputs?

A

The data frame and the number of clusters

In this example, the number of clusters specified is 2.

28
Q

What does the command as.factor() do in the context of k-means clustering?

A

It saves the cluster membership as a factor

This allows for categorical representation of cluster memberships.

29
Q

How do you separate records into clusters in R?

A

Using the which() command

This command selects records based on their cluster membership.

30
Q

What command is used to obtain descriptive statistics of each cluster in R?

A

summary() command

This command provides statistical summaries for the specified clusters.

31
Q

True or False: k-means clustering automatically selects the optimal number of clusters.

A

False

The user must specify the number of clusters beforehand.

32
Q

Why is it important to standardize numerical predictors before clustering?

A

To ensure that all variables contribute equally to the clustering process

Without standardization, variables with larger ranges could disproportionately influence the results.

33
Q

What is a centroid in the context of clustering?

A

The mean point of all points in a cluster

The centroid represents the center of a cluster in k-means clustering.

34
Q

What is the first step in validating clusters in k-means clustering?

A

Inputting the test data set and performing variable standardization

This ensures that the same preprocessing is applied to both training and test datasets.

35
Q

Fill in the blank: The command used to save the clustering algorithm output is _______.

A

kmeans01

This variable stores the results of the k-means clustering process.

36
Q

What is the purpose of the colnames() command in the clustering process?

A

To edit the column names of the standardized data frame

This helps to indicate that the variables are now standardized.

37
Q

What does the output of kmeans01$cluster represent?

A

Each record’s cluster membership

The output indicates which cluster each record belongs to (1 or 2 in this case).

38
Q

What are the two main data sets mentioned for clustering analysis?

A

white_wine_training and white_wine_test

These datasets are used for training and validating the clustering model.

39
Q

What is the significance of the variable ‘centers’ in the kmeans() function?

A

It specifies the number of clusters to form

In the provided example, it is set to 2.

40
Q

What does the subset() function do in the context of the test data set?

A

It selects specific variables, alcohol and sugar

This prepares the data for clustering by focusing on relevant predictors.