Cluster Analysis Flashcards
Cluster analysis is mainly used to aggregate of CASES, not FIELDS:
TRUE, some programs (like SPSS) allows the user to group also FIELD but this is far from being a standard use. CLUSTER is used to create groups of OBJECTS, cases, not variables.
TWO STEP cluster is able to balance, automatically, the size of resulting clusters
FALSE. Sizes we get for different groups depend on the natural dispersion of groups according to our settings (variables, distance metrics,….) and CANNOT be balanced automatically…
Both Euclidean Distance or Maximum Likelihood can be used in 2-Step Cluster as distance measurements, no matter the type / measurement of clustering variables we are using
FALSE. When we mix categorical and metric variables in a TWO STEP procedure, we SHOULD used Log-Likelihood.
We need homogeneous sets of data for hierarchical clusters (scale or categorical, but not mixed)
TRUE. At least with SPSS we can only use a type of variables because the measure of distance we may select depends on the type of variables we have (scale, counts or binary).
A dendrogram can be used to identify outliers
TRUE. Outlier cases would appear in the dedrogram as cases/objects far from any others (such ME in the MATCHING example for the class)
A dendrogram does not provide any support for selecting the proper number of clusters
FALSE. We CAN explore the structure of clustering in order to get a visual idea about the “ideal” number of clusters in terms of homogeneity. For instance, a dendrogram may shows that a solution with THREE clusters entails to MIX two different groups that are far from each other thus suggesting that FOUR clusters is more appropriate.
The choice of the distance measure in hierarchical clusters depends, basically, on the number of variables
FALSE. It depends on the measurement of variables.
A Two-Step cluster produce solutions based on mixtures of continuous and categorical variables
TRUE. This is in fact one of its benefits compared to other algorithms.
One of the problems of Two-Step is its inherent inability to handle outliers
FALSE. In fact, this is another advantage of Two-Step; in fact, the algorithm is able to automatically find and filter outlier cases (see class doc for more info on this)
The standardization of variables is a quite common as a preliminary step in cluster exercises
TRUE. If we don’t standardize, the size of each variable will impact in the distance metric and thus in the cluster result (remember what we saw in a class example about the importance of AGE compared to number of CHILDREN.
A balanced distribution of cluster sizes is COMPULSORY
FALSE. IT IS NOT COMPULSORY even if sometimes, balanced sizes for different clusters will be good news.
A hierarchical cluster is normally nonsense when having thousands of records to cluster
TRUE. Hierarchical is normally used when we want to track aggregation “path” or “process” ONLY for some selected cases of special interest among a limited number of other cases.
Technical quality of a CLUSTER output can be evaluated with the Silhouette measure
TRUE. Even if this is only about a technical assessment, it is true. It measures the compactness of each group and the separation between groups.
The size of clusters is NOT a relevant matter at all
FALSE. It is normally a relevant matter. We normally seek for large groups where we can profit any action so very small groups are not normally good news.
A cluster exercise is mainly a PREDICTIVE analytical exercise
NOT at all. We are not PREDICTING anything. We don’t have any “target”. CLUSTER is an UNSUPERVISED technique, just about an exploratory analysis.
Normally, a good cluster solution is shaped with a LARGE number of clustering variables (not less than 15)
FALSE. Normally it’s the other way around. A Cluster solution with lots of “dimensions” or input variables is a bit impractical so we typically prefer a solution depicted by a limited number of input fields.
A good cluster produces clusters with low SEPARATION and high COHESION
FALSE. A perfect group structure means HIGH cohesion (within each group) and HIGH separation (between groups).
The SILHOUETTE measure in a Cluster output is mainly used to evaluate the optimum number of clusters to get.
FALSE. It is used to evaluate the overall technical quality of our result in terms of SEPARATION and COHESION and this technical value depends on every cluster setting (selected variables, measure of distance,…) and not ONLY on the number of groups. It is true that the NUMBER of clusters will also impact this technical quality but SILHOUETTE is NOT the standard tool to select the NUMBER of clusters.
TWO STEP Cluster is called that way because the initial cluster solution is internally validated in a holdout sample during a second stage
FALSE. The name “two-step” comes from the idea of a double – stage when producing clusters (see technical annex in the class doc).
A field/variable can be very relevant to define/ distinguish a SPECIFIC CLUSTER without being of great importance for the solution as a whole
TRUE, this is true. Imagine a group solution with 4 groups of consumers (A, B, C & D). Imagine that two groups (Group C & Group D) are BOTH different from A and B in terms of age, average expenditure, and recency …. BUT similar between them and the only difference is that group D is about consumers living in rural areas. This “rural areas” may be ONLY of marginal interest to define group D, and not to understand A,B or C.
TWO STEP Cluster is called this way because it mixes two different algorithms to produce a final solution
TRUE. See technical annex in the class doc for more info.
EUCLIDEAN distance can be used when using TWO – STEP under certain assumptions
TRUE. When we only use scale variable we can still use Euclidean distance.
The selection of the clustering VARIABLES highly conditions the cluster solution we get
TRUE, TOTALLY. Groups we will get are completely conditional of the variables we use to understand similarity or distance.
The measure of DISTANCE does not have any impact on the cluster solution
FALSE. It has an impact, and not only because of technical reasons BUT also because, from a conceptual point of view, DISTANCE is the way we understand dissimilarity so by changing our notion of difference Vs similitude we change our understanding of groups.
A Two Step cluster - Outliers will be detected automatically
TRUE. The option “Outlier treatment - use noise handling” in the second screenshot will make SPSS to filter –out outlier cases automatically (see class document about Two – Step for further details)
We are using log-likelihood distance but Euclidean would have been a better choice
FALSE. Euclidean is NOT an option when we mix categorical and scale variables.