Cluster Analysis Flashcards by Ana María Aldana

Cluster analysis is mainly used to aggregate of CASES, not FIELDS:

TRUE, some programs (like SPSS) allows the user to group also FIELD but this is far from being a standard use. CLUSTER is used to create groups of OBJECTS, cases, not variables.

How well did you know this?

Not at all

Perfectly

TWO STEP cluster is able to balance, automatically, the size of resulting clusters

FALSE. Sizes we get for different groups depend on the natural dispersion of groups according to our settings (variables, distance metrics,….) and CANNOT be balanced automatically…

How well did you know this?

Not at all

Perfectly

Both Euclidean Distance or Maximum Likelihood can be used in 2-Step Cluster as distance measurements, no matter the type / measurement of clustering variables we are using

FALSE. When we mix categorical and metric variables in a TWO STEP procedure, we SHOULD used Log-Likelihood.

How well did you know this?

Not at all

Perfectly

We need homogeneous sets of data for hierarchical clusters (scale or categorical, but not mixed)

TRUE. At least with SPSS we can only use a type of variables because the measure of distance we may select depends on the type of variables we have (scale, counts or binary).

How well did you know this?

Not at all

Perfectly

A dendrogram can be used to identify outliers

TRUE. Outlier cases would appear in the dedrogram as cases/objects far from any others (such ME in the MATCHING example for the class)

How well did you know this?

Not at all

Perfectly

A dendrogram does not provide any support for selecting the proper number of clusters

FALSE. We CAN explore the structure of clustering in order to get a visual idea about the “ideal” number of clusters in terms of homogeneity. For instance, a dendrogram may shows that a solution with THREE clusters entails to MIX two different groups that are far from each other thus suggesting that FOUR clusters is more appropriate.

How well did you know this?

Not at all

Perfectly

The choice of the distance measure in hierarchical clusters depends, basically, on the number of variables

FALSE. It depends on the measurement of variables.

How well did you know this?

Not at all

Perfectly

A Two-Step cluster produce solutions based on mixtures of continuous and categorical variables

TRUE. This is in fact one of its benefits compared to other algorithms.

How well did you know this?

Not at all

Perfectly

One of the problems of Two-Step is its inherent inability to handle outliers

FALSE. In fact, this is another advantage of Two-Step; in fact, the algorithm is able to automatically find and filter outlier cases (see class doc for more info on this)

How well did you know this?

Not at all

Perfectly

The standardization of variables is a quite common as a preliminary step in cluster exercises

TRUE. If we don’t standardize, the size of each variable will impact in the distance metric and thus in the cluster result (remember what we saw in a class example about the importance of AGE compared to number of CHILDREN.

How well did you know this?

Not at all

Perfectly

A balanced distribution of cluster sizes is COMPULSORY

FALSE. IT IS NOT COMPULSORY even if sometimes, balanced sizes for different clusters will be good news.

How well did you know this?

Not at all

Perfectly

A hierarchical cluster is normally nonsense when having thousands of records to cluster

TRUE. Hierarchical is normally used when we want to track aggregation “path” or “process” ONLY for some selected cases of special interest among a limited number of other cases.

How well did you know this?

Not at all

Perfectly

Technical quality of a CLUSTER output can be evaluated with the Silhouette measure

TRUE. Even if this is only about a technical assessment, it is true. It measures the compactness of each group and the separation between groups.

How well did you know this?

Not at all

Perfectly

The size of clusters is NOT a relevant matter at all

FALSE. It is normally a relevant matter. We normally seek for large groups where we can profit any action so very small groups are not normally good news.

How well did you know this?

Not at all

Perfectly

A cluster exercise is mainly a PREDICTIVE analytical exercise

NOT at all. We are not PREDICTING anything. We don’t have any “target”. CLUSTER is an UNSUPERVISED technique, just about an exploratory analysis.

How well did you know this?

Not at all

Perfectly

Normally, a good cluster solution is shaped with a LARGE number of clustering variables (not less than 15)

FALSE. Normally it’s the other way around. A Cluster solution with lots of “dimensions” or input variables is a bit impractical so we typically prefer a solution depicted by a limited number of input fields.

How well did you know this?

Not at all

Perfectly

A good cluster produces clusters with low SEPARATION and high COHESION

FALSE. A perfect group structure means HIGH cohesion (within each group) and HIGH separation (between groups).

How well did you know this?

Not at all

Perfectly

The SILHOUETTE measure in a Cluster output is mainly used to evaluate the optimum number of clusters to get.

FALSE. It is used to evaluate the overall technical quality of our result in terms of SEPARATION and COHESION and this technical value depends on every cluster setting (selected variables, measure of distance,…) and not ONLY on the number of groups. It is true that the NUMBER of clusters will also impact this technical quality but SILHOUETTE is NOT the standard tool to select the NUMBER of clusters.

How well did you know this?

Not at all

Perfectly

TWO STEP Cluster is called that way because the initial cluster solution is internally validated in a holdout sample during a second stage

FALSE. The name “two-step” comes from the idea of a double – stage when producing clusters (see technical annex in the class doc).

How well did you know this?

Not at all

Perfectly

A field/variable can be very relevant to define/ distinguish a SPECIFIC CLUSTER without being of great importance for the solution as a whole

TRUE, this is true. Imagine a group solution with 4 groups of consumers (A, B, C & D). Imagine that two groups (Group C & Group D) are BOTH different from A and B in terms of age, average expenditure, and recency …. BUT similar between them and the only difference is that group D is about consumers living in rural areas. This “rural areas” may be ONLY of marginal interest to define group D, and not to understand A,B or C.

How well did you know this?

Not at all

Perfectly

TWO STEP Cluster is called this way because it mixes two different algorithms to produce a final solution

TRUE. See technical annex in the class doc for more info.

How well did you know this?

Not at all

Perfectly

EUCLIDEAN distance can be used when using TWO – STEP under certain assumptions

TRUE. When we only use scale variable we can still use Euclidean distance.

How well did you know this?

Not at all

Perfectly

The selection of the clustering VARIABLES highly conditions the cluster solution we get

TRUE, TOTALLY. Groups we will get are completely conditional of the variables we use to understand similarity or distance.

How well did you know this?

Not at all

Perfectly

The measure of DISTANCE does not have any impact on the cluster solution

FALSE. It has an impact, and not only because of technical reasons BUT also because, from a conceptual point of view, DISTANCE is the way we understand dissimilarity so by changing our notion of difference Vs similitude we change our understanding of groups.

How well did you know this?

Not at all

Perfectly

A Two Step cluster - Outliers will be detected automatically

TRUE. The option “Outlier treatment - use noise handling” in the second screenshot will make SPSS to filter –out outlier cases automatically (see class document about Two – Step for further details)

We are using log-likelihood distance but Euclidean would have been a better choice

FALSE. Euclidean is NOT an option when we mix categorical and scale variables.

Considering that we have 10.000 customers in our dataset, a hierarchical cluster would have been a better option that TWO STEPS

FALSE. Precisely because we have thousands of records, a hierarchical cluster doesn’t look a good option.

It is especially useful to find groups in a multivariate context

Yes. Finding groups without a Cluster algorithm is only feasible in univariate / bivariate contexts but, in the presence of several features we normally need a technical algorithm in order to find compact and well separated groups.

Its supervised, in the sense that we always will consider a clear leading variable to define or discover groups

No. "Supervised" means that we have a TARGET variable to be explained or predicted. In a cluster exercise we may have several inputs (clustering variables) but not a TARGET variable.

Can only be used with scale/continuous variables in our dataset

No. there are distance measures able to process either categorial and scale variables and even a mix of them

A cluster exercise is more about exploring grouping structure, that about confirming the existence of certain groups

Yes. It is a exploratory tool and, in fact, we normally get, inspect and compare several cluster solution before making a decision on the one that fits our needs.

Cluster is of great interest in the context of Market Segmentation

Yes. It is not only used for Market segmentation but its use in this specific context is even considered "a must".

There is no way of evaluating a cluster solution from a technical point of view

False. Even if technical quality is not the most important feature to be evaluated in a cluster solution, there are several metrics that we can use in order to asses this technical quality (Silhouette, for instance).

Cluster can be mainly used to reduce the number of variables in a dataset

No. This technique is not about dimensions reduction; it explores homogeneity across individuals or objects (rows in a dataset) not across variables (columns in a dataset).

We don't need a target variable to run a Cluster analysis

Right. This is not a supervised technique.

To some extent, a good or bad cluster result is not only about technical features but also depends on the context of our analysis

Exactly. A cluster solution should be always aligned with our expectations and be useful in terms of action even if it is not the optimum one in technical terms.

The number of clusters to get is usually clear from the begining of a cluster exercise

Not at all. Sometimes we may have a preliminar idea about the optimum number we would like to get, but not about the number we will finally get when we use a given algorithm.

It is useful to use as many variables as you can at the same time to improve clusters definition

FALSE. Including LOTS OF VARIABLES is normally a bad strategy and, of course, does not implies a better solution.

We don’t necessarily need to standardize metric variables IF ALL OF THEM of them are comparable because they are expressed in similar units / scales

TRUE. Standardization is very frequent BUT it is only needed when scale variables are expressed in different units.

Euclidean distance is a standard in the presence of metric variables or binary variables

TRUE. It’s only main limitation is for handling categorical variables.

It is useful and common to drop or exclude atypical or anomalous cases to find a clearer Cluster solution

TRUE. This is mentioned in the class doc and it is always a GOOD advice for every single analytical procedure you may want to use. Concentrate in what is “normal” without paying attention to odd cases.

There is a variety of distance measures we can use that can produce very different clusters results

TRUE. Distance affects results (among other settings).

The distance measure to choose depends critically in the NUMBER of variables we need to use for our clustering exercise

FALSE. Depends on the measurement of variables

We run a hierarchical cluster when we want to find a clustering solution for thousands of records

FALSE. We run a hierarchical for a limited number of objects when we want to explore the agglomeration schedule for some objects/subjects of interest.

We can use a hierarchical cluster combining metric and categorical variables

FALSE. We can only use a type of measurement, not a mix of both.

We are usually interested in the agglomeration process of a hierarchical clustering solution

TRUE. This is exactly the aim when we use a hierarchical algorithm

If we want to explore and understand simmilarities between our 25 competitors in the market, a hierarchical cluster looks a good solution

TRUE. This would be good example. Our interest is to position our brand compared to others, so we may need to inspect distances / proximities brand by brand.

A hierarchical cluster always uses an agglomeration procedure

FALSE. The algorithm might be divisive (check the illustration in the second class-doc)

A hierarchical cluster using "N" observations will always offer a range of cluster's solutions from "N" to "1" clusters

TRUE. At the beginning of the procedure, each case is a group (N groups) and the process always ended with every case in the same group (1 group)

A dendrogram graph can be used to:

Can be used to detect outliers

The Distance Matrix can be used to:

- Can be used to explore simmilarity between individuals/objects - Shows the multivariate distance between objects/individuals - Is interesting, and essential to run the agglomeration procedure

We could use a crosstabs to check gender distribution across clusters

TRUE. Both variables are categorical

We could use a Box Plot Graph to explore age (in years) across clusters

TRUE. Age (if it is measured in years) is a metric variable and cluster membership is categorical, so we may use a Box-Plot.

We would use ANOVA to check statistical differences in age (in years) across clusters if we have more than two clusters in our solution

TRUE. This is the test for the combination “metric variable & categorical variable” when the categorical variable has more than two categories. If we only have TWO clusters we could use a t-test.

We cannot use a linear correlation analysis because our cluster membership variable is not metric

TRUE. We can only use it for two METRIC variables.

TSC could be a good option if we have thousands of records to cluster

TRUE. Hierarchical is not an option for thousands of records.

TSC output shows a dendrogram if needed

FALSE. It is not an option and, besides, it would be impractical with thousands of records in the dataset.

TSC helps us to automatically identify the "optimum" number of clusters (from a technical point of view)

TRUE. Yes, at least in SPSS this is an option.

The name TWO-STEP is because TSC combines internally two different procedures in two different stages.

TRUE. Check technical annex in the class doc if you are interested in details.

Two Step can only produce solutions for continuous variables.

FALSE. IN fact, this is one of the advantages of TWO-SETP (the ability to deal with a mix of continuous and categorical variables)

There is commonly a penalty to the cluster quality for assuming a certain probability distribution for continuous and categorical

TRUE. Check the class doc (disadvantage number 3, page 14).

TSC is part of the hierarchical clustering algorithms family

FALSE. It is not hierarchical.

TSC can NOT handle outliers automatically

FALSE. In fact, this is one of the advantages of TWO-STEP.

We should better standardize our METRIC clustering variables using TSC in SPSS if they are not comparable in units.

TRUE. If we want to avoid an impact of units in the cluster output.

If you want to explore different travellers profiles from your large database of trips reservations during the last year, Which algorithm would you use?

Two Step Cluster looks good. If we have a large dataset of transactions and, besides, one assume that we have to deal with a combination of metric and not metric variables, a hierarchical cluster is not a good option.

Cluster Analysis Flashcards

(65 cards)