12 Quantitative Methods - Cluster Analysis Flashcards

1
Q

Questions on Ettensperger, Felix and Elina Schleutker (2022): “Identification of Cross-Country
Similarities and Differences in Regulation of Religion Between 2000-2014 with Help of Cluster
Analysis”. Politics and Religion 15: 526-558.
(1) Read the introduction: What is the research question?

A

“comprehensive classification of both democratic and authoritarian countries in 2000 and
2014 when it comes to regulation of religion”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

(2) Read the definition of regulation of religion on page 527-528. What are the different
types of regulation of religion?

A

It is common to distinguish between, on the one hand, government regulation of religion
and, on the other hand, social regulation of religion (i.e., regulation imposed on religious
groups or individuals by non-governmental actors).
As for the government regulation, it is further customary to distinguish between two
different dimensions, namely positive endorsement of and negative restrictions on
religion.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

(3) Read the methodology-section.
(3.1) Why do the authors find cluster analysis to be particularly suited for their study?

A

We find cluster analysis particularly suited for our study, as it enables the classification of
a large number of countries and further makes it possible to study if and how the
clustering of the countries changes over time and when a different set of indicators is
used. Thus, a comparison of the results from various cluster analyses makes it possible
to detect robust cluster patterns within the data, find out which countries change cluster
affinity over time and identify both outliers as well as borderline cases between two
clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

(3.2) What are dendrograms and how should they be read (see also question 5.1)?

A

The clustering is visualized in a dendrogram, showing us the closeness and relationship
between country-cases in our sample. The closer the cases are connected via the
branches of the tree diagram, the higher the similarities between these individual cases
are. This allows us to compare our results not only to quantitative, but also to qualitative
studies in the domain of regulation of religion. We can evaluate if previous observations
about the similarities and differences in state–religion relationships are reflected in the
empirical data by evaluating the distance of cases within clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

(3.3) Make a list of the different steps the authors take in their cluster analysis. (You will not understand much of the different steps for now, and should read the next reading by König and Jäckle, which will clarify these technicalities)

A

Regarding the cluster analysis, the first step of the empirical investigations was to study what the mathematically best number of clusters is. This is important, as the identification of the mathematically optimal number of clusters will minimize the number of countries with low cluster affinity.
The results from each cluster analysis are shown in dendrogram format, which makes it possible to study how the cluster trees are generated, and how the internal structure, the existence of sub-clusters, and the proximity of cases inside of clusters are constituted.
To study the quality of the formed clusters, we employ silhouette analysis (see Rousseeuw 1987). The silhouette width of an individual country can vary between −1 and 1. Values close to 1 indicate a good fit (the country is very similar to the other countries in the cluster), whereas values close to −1 indicate a poor fit (the country is very dissimilar from the other countries in the cluster).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

(3.4) Briefly explain where the data comes from and what the case selection is.

A

The measurement of regulation of religion is based on the third round of the Religion and State project, RAS3 (Fox 2019). The RAS3 dataset includes altogether 36 variables on discrimination against minority religions; 29 types of restrictions on the regulation of and restrictions on the majority religion and all religions and 27 types of non-government discrimination, harassment, acts of prejudice and violence against minority religions. All these variables are coded from 0 to 3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

(5) Read section “results for authoritarian regimes”. In essence, the authors find that it is possible to distinguish between three different groups of authoritarian regimes based on the levels of regulation.
(5.1) Which countries belong to the different groups (to get a detailed list of the countries in each group, you can study the dendrograms)?

A

Cluster 1 consists of 19 countries mostly located in the MENA region. Cluster 2 is the largest cluster (40 countries, mainly located in Sub-Saharan Africa) with low average levels of regulation. Finally, in the third cluster, we find 18 countries from various geographical locations. With the exception of Myanmar, Syria, and Turkey, all countries in this cluster have made experiences with communist rule.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

(5.3) What are the authors’ main results when it comes to the comparison between years and to previous research?

A

As already mentioned above, the clustering of the countries changes somewhat in 2000 depending on the indicators, which are included in the cluster analysis, whereas in 2014 the clustering is almost the same independently from the included indicators.
A comparison of our results to the studies listed in Table 1 shows that in general, our results are similar to previous attempts to cluster authoritarian countries, and consequently also compatible with the theoretical frameworks, which underline these classifications. In contrast to these previous studies, however, our classification provides the empirically most rigorous findings, demonstrates that the clusters (especially in 2014) are relatively stable independently from which indicators are studied and allows us to identify countries, which are borderline cases between two clusters and thus difficult to classify.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Questions on König, Pascal D. and Sebastian Jäckle (2017): “Clusteranalyse”. In: Jäckle, Sebastian (ed.): “Neue Trends in den Sozialwissenschaften Innovative Techniken für qualitative und quantitative Forschung“. Springer VS. (pages to read: 51-84).
(1) Read sections 1 and 2. What is cluster analysis and what is it good for?

A

Inductive, to seek and build group structures based on data. The method is useful to see how data can get classified into various groups based on qualitative and/or quantitative differences. The groups are meant to be rather homogeneous and identifiable in comparison to other groups, so they are likely to vary significantly from each other. So the groups are not identified beforehand but are rather determined based on the data and how the data can be grouped.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

There are different types of clustering, as there is no unique form of it, consisting of the hierarchical and the partitioned (k-means) methods.
(2.1) What is the difference between agglomerative clustering (bottom-up) and divisive-clustering (top-down)?

A

Both, agglomerative (bottom-up) and divisive (top-down) are part of the hierarchical clustering methods. Agglomerative clustering is the “melting” (grouping?) one after each other of individual observations. When a whole cluster consisting of all observations is split into single objects, it’s a divisive clustering method.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

(2.2) What is the purpose of similarity measures (Ähnlichkeitsmaßen) in clustering?

A

There are similarity measures (Ähnlichkeitsmaßnahmen) and distance measures which are measured for comparing pairs of objects with interesting features. This allows to assess if two elements are so similar to be put into the same cluster. There are types of similarity measures such as Matching-coefficient, Phi-coefficient, Rogers-Tanimoto.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

(2.3) The authors describe several clustering methods, which are illustrated in Table 3. Try to make sense of each of these methods, and their purpose.

A

Single Linkage: minimum distance between two links
complete linkage: longest distance between two links
average linkage: average distance between two links

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

(2.4) What are the strengths and weaknesses of hierarchical clustering?

A

There are no firmly established criteria for hierarchical clustering thus it remains more explorative. This makes it more flexible and more open to interpretation but on the other side it makes it more complex and less clear. It is usually limited to a maximum of several hundred objects, thus a smaller data base, nevertheless, this also means that several outliers can highly affect the dendogram and its structure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

(2.5) What is a dendrogram and what does it show?

A

Dendogram: based on distance / similarity measures, it shows the “clustering levels” of the objects and the connection between the objects and groups. Because each object can only be allocated to one superordinate group, there is a hierarchical structure. The leaves of the groups shows cluster homogeneity. The longer, the more homogeneous. The optimum number of clusters are found across the horizontal line, when all cut leaves have a high enough distance to the next “melting point”. Example shows that the best would be a two-cluster solution, followed by a four-cluster solution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

(3.1) What is k-means clustering and does it differ from hierarchical clustering?

A

It is a variant of partitioned clustering. While in hierarchical clustering, the amount of clusters are determined afterwards based on the analysis process, in the partitioned clustering the clusters are pre-determined with ideally its own means.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

(3.2) What are the four steps in k-means clustering?

A
  1. the objects are allocated randomly to the clusters
  2. the cluster centres are calculated based on the relevant objects in each clusters
  3. the objects that are meant to be classified are allocated to the clusters with the cluster centre that has the lowest quadrupled Euclidean distance. This lowers the dispersed
    square sum in the clusters and increases the dispersed square sum in the clusters. This allows the clusters to be more homogeneous within and more heterogeneous among each other.
  4. the algorithm iterates (repeats) between step 2 and 3 until the dispersion square sum within the cluster reached a minimum and the allocation of the objects to the clusters is not changed. But even if there is a solution, it is not necessarily the minimum (?).
17
Q

(3.3) What are the strengths and weaknesses of k-means clustering?

A

Weaknesses: the algorithm converges worse when the clusters significantly overlap, leading to the issue that a much larger sample size is needed (ca. 1000) to reach a meaningfully stable solution compared to a cluster structure with almost no overlaps where only ca. 50 objects are enough. Secondly, k-Means does not necessarily find the global optimal cluster result, but finds different locally optimums. To check the sensitivity of the result based on the selection of the starting points, it is recommended to start the algorithm repeatedly from different starting points. If the results largely match, surely the global minimum is found. The more random starting points are necessary to find a global minimum, the more indistinguishable are the real clusters from each other.
The third problem is that the amount of clusters is to be pre-determined although the researcher does not know the exact number before. So there are two methods: to compute with different amounts of clusters (allows to find out optimal amount of clusters by differences in cluster homogeneity) and compare the results or use a different type of cluster analysis (e.g. hierarchical agglomerative, through a random sample if it is too large; to find optimal cluster result as a starting point for the k-means analysis). K-means is also highly susceptible to outliers in the data. And because the algorithm is based on the arithmetic mean, the variables need to be scaled metrically.

18
Q

(4) Read section 3.3. What is fuzzy-clustering, and what are its strengths and weaknesses?

A

To loosen up the discrete allocation to clusters, there is the possibility of a fuzzy-set. It creates for each observation allocative values related to the k-built clusters, in which values above 0.5, on a scale between 0 and 1, are interpreted as allocative (zugehörig) instead of not-allocative (which are values under 0.5). They also can be seen as probability scores. Either through pre-selected or randomly selected cluster means, the allocation values for the observations are calculated (they are between 0 and 1), leading to the calculation of cluster means.
Positives: The values (between 0 and 1) allows a weighted calculation of cluster means. Preferred to k-means if there is ambiguity of cases and outliers.
Negatives: the method converges at a minimum for the weighted deviations of the observations of the cluster means. This leads to the danger, that only a local minimum can be reached, the probability is however lower compared to k-means methods. Problematic can be outliers and the necessity to pre-determine an amount of clusters.
Fuzzy-Means-Clustering is less suited if empirically multiple entire clustering possibilities exist. Another weakness is that it tends to extract spherical shapes like k-means.

19
Q

(5) Read section 3.4. What is two-step cluster analysis, and when is it usually used?

A

Because the hierarchical-agglomerative analysis is suited for rather small sample sizes, two-step cluster analysis has been created to deal with very large data sets (for example millions). It’s also suited for explorative data mining and for metric and categorical variables. It does not have to have a pre-determined cluster number and an optimal cluster number computed through great interpretative means afterwards. It is to be determined in a consistent manner.

20
Q

How did Esping-Andersen impact the use of cluster analysis?

A

Before Esping-Andersen welfare states were compared based on their welfare spending. But Same amount of spending on welfare can in two different countries go tostructurally completely different types of programs (e.g., Germany vs. Sweden). Therefore the author created classifications based on qualitative differences (e.g. Anglo-Saxon, conservative, Nordic)