Cluster Analysis Flashcards

1
Q

Cluster analysis is mainly used to aggregate of CASES, not FIELDS:

A

TRUE, some programs (like SPSS) allows the user to group also FIELD but this is far from being a standard use. CLUSTER is used to create groups of OBJECTS, cases, not variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

TWO STEP cluster is able to balance, automatically, the size of resulting clusters

A

FALSE. Sizes we get for different groups depend on the natural dispersion of groups according to our settings (variables, distance metrics,….) and CANNOT be balanced automatically…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Both Euclidean Distance or Maximum Likelihood can be used in 2-Step Cluster as distance measurements, no matter the type / measurement of clustering variables we are using

A

FALSE. When we mix categorical and metric variables in a TWO STEP procedure, we SHOULD used Log-Likelihood.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

We need homogeneous sets of data for hierarchical clusters (scale or categorical, but not mixed)

A

TRUE. At least with SPSS we can only use a type of variables because the measure of distance we may select depends on the type of variables we have (scale, counts or binary).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

A dendrogram can be used to identify outliers

A

TRUE. Outlier cases would appear in the dedrogram as cases/objects far from any others (such ME in the MATCHING example for the class)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

A dendrogram does not provide any support for selecting the proper number of clusters

A

FALSE. We CAN explore the structure of clustering in order to get a visual idea about the “ideal” number of clusters in terms of homogeneity. For instance, a dendrogram may shows that a solution with THREE clusters entails to MIX two different groups that are far from each other thus suggesting that FOUR clusters is more appropriate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

The choice of the distance measure in hierarchical clusters depends, basically, on the number of variables

A

FALSE. It depends on the measurement of variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

A Two-Step cluster produce solutions based on mixtures of continuous and categorical variables

A

TRUE. This is in fact one of its benefits compared to other algorithms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

One of the problems of Two-Step is its inherent inability to handle outliers

A

FALSE. In fact, this is another advantage of Two-Step; in fact, the algorithm is able to automatically find and filter outlier cases (see class doc for more info on this)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

The standardization of variables is a quite common as a preliminary step in cluster exercises

A

TRUE. If we don’t standardize, the size of each variable will impact in the distance metric and thus in the cluster result (remember what we saw in a class example about the importance of AGE compared to number of CHILDREN.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

A balanced distribution of cluster sizes is COMPULSORY

A

FALSE. IT IS NOT COMPULSORY even if sometimes, balanced sizes for different clusters will be good news.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

A hierarchical cluster is normally nonsense when having thousands of records to cluster

A

TRUE. Hierarchical is normally used when we want to track aggregation “path” or “process” ONLY for some selected cases of special interest among a limited number of other cases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Technical quality of a CLUSTER output can be evaluated with the Silhouette measure

A

TRUE. Even if this is only about a technical assessment, it is true. It measures the compactness of each group and the separation between groups.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

The size of clusters is NOT a relevant matter at all

A

FALSE. It is normally a relevant matter. We normally seek for large groups where we can profit any action so very small groups are not normally good news.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

A cluster exercise is mainly a PREDICTIVE analytical exercise

A

NOT at all. We are not PREDICTING anything. We don’t have any “target”. CLUSTER is an UNSUPERVISED technique, just about an exploratory analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Normally, a good cluster solution is shaped with a LARGE number of clustering variables (not less than 15)

A

FALSE. Normally it’s the other way around. A Cluster solution with lots of “dimensions” or input variables is a bit impractical so we typically prefer a solution depicted by a limited number of input fields.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

A good cluster produces clusters with low SEPARATION and high COHESION

A

FALSE. A perfect group structure means HIGH cohesion (within each group) and HIGH separation (between groups).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

The SILHOUETTE measure in a Cluster output is mainly used to evaluate the optimum number of clusters to get.

A

FALSE. It is used to evaluate the overall technical quality of our result in terms of SEPARATION and COHESION and this technical value depends on every cluster setting (selected variables, measure of distance,…) and not ONLY on the number of groups. It is true that the NUMBER of clusters will also impact this technical quality but SILHOUETTE is NOT the standard tool to select the NUMBER of clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

TWO STEP Cluster is called that way because the initial cluster solution is internally validated in a holdout sample during a second stage

A

FALSE. The name “two-step” comes from the idea of a double – stage when producing clusters (see technical annex in the class doc).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

A field/variable can be very relevant to define/ distinguish a SPECIFIC CLUSTER without being of great importance for the solution as a whole

A

TRUE, this is true. Imagine a group solution with 4 groups of consumers (A, B, C & D). Imagine that two groups (Group C & Group D) are BOTH different from A and B in terms of age, average expenditure, and recency …. BUT similar between them and the only difference is that group D is about consumers living in rural areas. This “rural areas” may be ONLY of marginal interest to define group D, and not to understand A,B or C.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

TWO STEP Cluster is called this way because it mixes two different algorithms to produce a final solution

A

TRUE. See technical annex in the class doc for more info.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

EUCLIDEAN distance can be used when using TWO – STEP under certain assumptions

A

TRUE. When we only use scale variable we can still use Euclidean distance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

The selection of the clustering VARIABLES highly conditions the cluster solution we get

A

TRUE, TOTALLY. Groups we will get are completely conditional of the variables we use to understand similarity or distance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

The measure of DISTANCE does not have any impact on the cluster solution

A

FALSE. It has an impact, and not only because of technical reasons BUT also because, from a conceptual point of view, DISTANCE is the way we understand dissimilarity so by changing our notion of difference Vs similitude we change our understanding of groups.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

A Two Step cluster - Outliers will be detected automatically

A

TRUE. The option “Outlier treatment - use noise handling” in the second screenshot will make SPSS to filter –out outlier cases automatically (see class document about Two – Step for further details)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

We are using log-likelihood distance but Euclidean would have been a better choice

A

FALSE. Euclidean is NOT an option when we mix categorical and scale variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Considering that we have 10.000 customers in our dataset, a hierarchical cluster would have been a better option that TWO STEPS

A

FALSE. Precisely because we have thousands of records, a hierarchical cluster doesn’t look a good option.

28
Q

It is especially useful to find groups in a multivariate context

A

Yes. Finding groups without a Cluster algorithm is only feasible in univariate / bivariate contexts but, in the presence of several features we normally need a technical algorithm in order to find compact and well separated groups.

29
Q

Its supervised, in the sense that we always will consider a clear leading variable to define or discover groups

A

No. “Supervised” means that we have a TARGET variable to be explained or predicted. In a cluster exercise we may have several inputs (clustering variables) but not a TARGET variable.

30
Q

Can only be used with scale/continuous variables in our dataset

A

No. there are distance measures able to process either categorial and scale variables and even a mix of them

31
Q

A cluster exercise is more about exploring grouping structure, that about confirming the existence of certain groups

A

Yes. It is a exploratory tool and, in fact, we normally get, inspect and compare several cluster solution before making a decision on the one that fits our needs.

32
Q

Cluster is of great interest in the context of Market Segmentation

A

Yes. It is not only used for Market segmentation but its use in this specific context is even considered “a must”.

33
Q

There is no way of evaluating a cluster solution from a technical point of view

A

False. Even if technical quality is not the most important feature to be evaluated in a cluster solution, there are several metrics that we can use in order to asses this technical quality (Silhouette, for instance).

34
Q

Cluster can be mainly used to reduce the number of variables in a dataset

A

No. This technique is not about dimensions reduction; it explores homogeneity across individuals or objects (rows in a dataset) not across variables (columns in a dataset).

35
Q

We don’t need a target variable to run a Cluster analysis

A

Right. This is not a supervised technique.

36
Q

To some extent, a good or bad cluster result is not only about technical features but also depends on the context of our analysis

A

Exactly. A cluster solution should be always aligned with our expectations and be useful in terms of action even if it is not the optimum one in technical terms.

37
Q

The number of clusters to get is usually clear from the begining of a cluster exercise

A

Not at all. Sometimes we may have a preliminar idea about the optimum number we would like to get, but not about the number we will finally get when we use a given algorithm.

38
Q

It is useful to use as many variables as you can at the same time to improve clusters definition

A

FALSE. Including LOTS OF VARIABLES is normally a bad strategy and, of course, does not implies a better solution.

39
Q

We don’t necessarily need to standardize metric variables IF ALL OF THEM of them are comparable because they are expressed in similar units / scales

A

TRUE. Standardization is very frequent BUT it is only needed when scale variables are expressed in different units.

40
Q

Euclidean distance is a standard in the presence of metric variables or binary variables

A

TRUE. It’s only main limitation is for handling categorical variables.

41
Q

It is useful and common to drop or exclude atypical or anomalous cases to find a clearer Cluster solution

A

TRUE. This is mentioned in the class doc and it is always a GOOD advice for every single analytical procedure you may want to use. Concentrate in what is “normal” without paying attention to odd cases.

42
Q

There is a variety of distance measures we can use that can produce very different clusters results

A

TRUE. Distance affects results (among other settings).

43
Q

The distance measure to choose depends critically in the NUMBER of variables we need to use for our clustering exercise

A

FALSE. Depends on the measurement of variables

44
Q

We run a hierarchical cluster when we want to find a clustering solution for thousands of records

A

FALSE. We run a hierarchical for a limited number of objects when we want to explore the agglomeration schedule for some objects/subjects of interest.

45
Q

We can use a hierarchical cluster combining metric and categorical variables

A

FALSE. We can only use a type of measurement, not a mix of both.

46
Q

We are usually interested in the agglomeration process of a hierarchical clustering solution

A

TRUE. This is exactly the aim when we use a hierarchical algorithm

47
Q

If we want to explore and understand simmilarities between our 25 competitors in the market, a hierarchical cluster looks a good solution

A

TRUE. This would be good example. Our interest is to position our brand compared to others, so we may need to inspect distances / proximities brand by brand.

48
Q

A hierarchical cluster always uses an agglomeration procedure

A

FALSE. The algorithm might be divisive (check the illustration in the second class-doc)

49
Q

A hierarchical cluster using “N” observations will always offer a range of cluster’s solutions from “N” to “1” clusters

A

TRUE. At the beginning of the procedure, each case is a group (N groups) and the process always ended with every case in the same group (1 group)

50
Q

A dendrogram graph can be used to:

A

Can be used to detect outliers

51
Q

The Distance Matrix can be used to:

A
  • Can be used to explore simmilarity between individuals/objects
  • Shows the multivariate distance between objects/individuals
  • Is interesting, and essential to run the agglomeration procedure
52
Q

We could use a crosstabs to check gender distribution across clusters

A

TRUE. Both variables are categorical

53
Q

We could use a Box Plot Graph to explore age (in years) across clusters

A

TRUE. Age (if it is measured in years) is a metric variable and cluster membership is categorical, so we may use a Box-Plot.

54
Q

We would use ANOVA to check statistical differences in age (in years) across clusters if we have more than two clusters in our solution

A

TRUE. This is the test for the combination “metric variable & categorical variable” when the categorical variable has more than two categories. If we only have TWO clusters we could use a t-test.

55
Q

We cannot use a linear correlation analysis because our cluster membership variable is not metric

A

TRUE. We can only use it for two METRIC variables.

56
Q

TSC could be a good option if we have thousands of records to cluster

A

TRUE. Hierarchical is not an option for thousands of records.

57
Q

TSC output shows a dendrogram if needed

A

FALSE. It is not an option and, besides, it would be impractical with thousands of records in the dataset.

58
Q

TSC helps us to automatically identify the “optimum” number of clusters (from a technical point of view)

A

TRUE. Yes, at least in SPSS this is an option.

59
Q

The name TWO-STEP is because TSC combines internally two different procedures in two different stages.

A

TRUE. Check technical annex in the class doc if you are interested in details.

60
Q

Two Step can only produce solutions for continuous variables.

A

FALSE. IN fact, this is one of the advantages of TWO-SETP (the ability to deal with a mix of continuous and categorical variables)

61
Q

There is commonly a penalty to the cluster quality for assuming a certain probability distribution for continuous and categorical

A

TRUE. Check the class doc (disadvantage number 3, page 14).

62
Q

TSC is part of the hierarchical clustering algorithms family

A

FALSE. It is not hierarchical.

63
Q

TSC can NOT handle outliers automatically

A

FALSE. In fact, this is one of the advantages of TWO-STEP.

64
Q

We should better standardize our METRIC clustering variables using TSC in SPSS if they are not comparable in units.

A

TRUE. If we want to avoid an impact of units in the cluster output.

65
Q

If you want to explore different travellers profiles from your large database of trips reservations during the last year, Which algorithm would you use?

A

Two Step Cluster looks good. If we have a large dataset of transactions and, besides, one assume that we have to deal with a combination of metric and not metric variables, a hierarchical cluster is not a good option.