13 Quantitative Methods - Quantitative Text Analysis Flashcards

Question 1

Q

Questions for preparation
Questions on Törnberg, Anton and Petter Törnberg (2016): “Muslims in social media discourse: Combining topic modeling and critical discourse analysis”. Discourse, Context and Media 13: 132-142.
(3.1) What methods do the authors use and why?

Answer

A

This article uses a combination of corpus-linguistic (CL) approach and Critical Discourse Analysis (CDA) to study the representation of Muslims and Islam in social media. The CL approach helps to deal with the vast amount of unstructured data, while CDA addresses methodological weaknesses in previous studies such as lack of academic rigor and small data sets. The study is placed within the field of Corpus-Assisted Discourse Studies (CADS), which combines Discourse Analysis and techniques for corpus enquiries from CL. The CL approach focuses on the analysis of words and their textual context using wordlists, keywords, collocations, and concordances, backed up with measures of statistical significance.

Question 2

Q

(3.2) Based on what you read, what do you find out about the method “topic modeling”?

Answer

A

Topic modeling is a methodology that uses Latent Dirichlet Allocation (LDA) algorithm to uncover the recurring clusters of co-occurring words in a document collection to quantify and visualize the themes that arise inductively from texts. The logic behind topic modeling is that a document about a certain topic is more likely to contain related words associated with that topic. LDA is based on Bayesian statistical theory, where topics and per-document topic proportions are seen as latent variables in a hierarchical probabilistic model, and the conditional distribution of those variables is approximated given an observed collection of documents.

Question 3

Q

(3.3) What material do the authors use?
Text: Use word frequencies for Trump tweets

Answer

A

For technical reasons, standard LDA generally works best for documents with a size of at least 1000 words. We therefore aggregate all posts from each individual user in a specific thread within the same time period into chunks of 1000-word documents. Posts that are significantly longer than 1000 words are split up into smaller chunks.

Question 4

Q

Questions on Benoit, Ken (2020): “Text as Data: An Overview”. In: Luigi Curini and Robert Franzese (eds): The SAGE Handbook of Research Methods in Political Science and International Relations. SAGE Publications Ltd. (pages to read: 461-468)

(1) Why is there demand for the kind of methodology, which is treating text as data?

Answer

A

Never before has so much text been so readily available on such a wide variety of topics that concern our discipline. Legislative debates, party manifestos, committee transcripts, candidate and other political speeches, lobbying documents, court opinions, laws – not only are all recorded and published today, but in many cases this is in a readily available form that is easily converted into structured data for systematic analysis. When processed into structured form, this textual record provides a rich source of data to fuel the study of politics. This revolution in the quantity and avail ability of textual data has vastly broadened the scope of questions that can be investigated empirically, as well as the range of political actors to which they can be applied. We will never have access to direct, physical measures of the content or intensity of such core concepts as ideology, commitment to democracy or differing preferences or priorities for competing policies. But what political actors say, more than the behaviour they exhibit, provides evidence of their true inner states.

Question 5

Q

Read “Text as Text versus Text as Data”. There is a lot of information in this section, and it takes time to digest it. Consider reading this section twice: Once before you have answered question 3, and the second time after you have answered question 3.
(2) What does it mean to treat text as data?

Answer

A

Treating texts as data means arranging it for the purpose of analysis, using a structure that probably was not partof the process that generated the data itself. This step starts with collecting it into a corpus, which involves defining a sample of the available texts, out of all otherpossible texts that might have been selected. Just as with any other research, theprinciples of research design govern how to choose this sample, and should beguided by the research question. Once selected, we then impose substantial selection and abstraction in the form ofconverting the selected texts into a more structured form of data.

Question 6

Q

What is CATA?

Answer

A

This article gives an overview of “computer-aided text analysis” (in essence, quantitative text analysis” in the context of conflict research. In case you are interested in quantitative text analysis, Table 3 is likely to be extremely valuable for you.
CATA:
Computer-aided text analysis (CATA) offers exciting new possibilities for conflict research that this contribution describes using a range of exemplary studies from a variety of disciplines including sociology, political science, commu nication studies, and computer science. This includes both conflict as it is verbalized in the news media, in political speeches, and other public documents and conflict as it occurs in online spaces (social media platforms, forums). Both (offline and online) work using inductive computational procedures, such as topic modeling, and supervised machine learning approaches are assessed, as are more traditional forms of content analysis, such as dictionaries. Finally, cross validation is highlighted as a crucial step in CATA, in order to make the method as useful as possible to scholars interested in enlisting text mining for conflict research.

Question 7

Q

(1.2) What are the advantages and disadvantages of dictionary approaches? (see also “Conclusion”)

Answer

A

The article discusses the use of dictionaries and machine translation in studying conflict across languages. While compiling a multilingual dictionary can be burdensome, machine translation can be used to convert a monolingual dictionary into a multilingual one. Person and place names, specific events, and actions can be captured by such a dictionary with relative accuracy. The article also mentions the use of data mining and machine learning software to generate dictionaries or resources. Studies have also relied on dictionaries to study conflict in virtual environments, such as the analysis of hate speech on the Facebook pages of extreme-right political parties in Spain. Simple word frequencies were used to group posts into thematic categories, and terms most frequently used within each group were extracted to yield category descriptions of different groups of immigrants and other “enemies.”

Question 8

Q

(1.3) What is sentiment analysis?

Answer

A

The method of sentiment analysis involves analyzing text to determine whether it expresses a positive or negative sentiment. This is often done using sentiment dictionaries, which categorize words into positive or negative sentiment. Sentiment dictionaries exist in many languages and text types and can be quite comprehensive. In binary classification, the logarithm of the ratio of positive to negative words is often used to calculate a weighted composite score.

Question 9

Q

(1.4) What are off-the-shelf dictionaries? How can dictionaries be created?

Answer

A

There are various dictionaries available for different applications such as policy areas, moral foundations, and standardized language use. These off-the-shelf dictionaries provide validity and can even be used for different languages. Dictionaries can be created using different techniques such as manual labeling to extract distinctive terms.

Question 10

Q

(1.5) How can dictionary analysis be validated?

Answer

A

Dictionary approaches, along with supervised and unsupervised learning, are effective at reducing complexity by turning words into category distributions. This is based on the bag-of-words philosophy of text analysis, which records only the frequency of each word within a text, but not where it occurs. This approach is useful for distilling aggregate meaning from data, but can result in a loss of structure and meaning in the pre-processing and cleaning of text. Dictionaries should always be validated using the data to which they are applied, as applying a dictionary to a different domain may result in inaccurate results. Systematically validating dictionary results, such as through content analysis, can help overcome these issues.

Question 11

Q

(2) In section 3, the authors review “supervised methods”.
(2.1) What is the idea/intuition behind supervised methods?

Answer

A

Supervised machine learning (SML) is an advanced technique that can be used to analyze textual material by connecting feature distribution patterns with human judgment. This technique involves categorizing textual material according to specific criteria and then training an algorithm to make predictions based on these categories. Once the classifier is trained, it can be used to classify new material. SML involves splitting the data into training and test sets, and using metrics to evaluate the classifier’s performance. This technique can be used to validate traditional content analysis, annotate unknown material, and discover relationships between external variables that predict language use.

Question 12

Q

(2.2) What are the advantages and disadvantages of supervised methods? (see also “Conclusion”)

Answer

A

Supervised machine learning (SML) has various applications in conflict research and social sciences. In traditional content analysis, SML can act as an “algorithmic coder” that aims to predict the consensus judgment of human coders treated as “ground truth”. However, the quality of annotation and the relation between content and code is key, and “ground truth” should be treated with care. SML can reliably predict the topic or theme of a text but can yield poor results for other categories, such as controversy and crime. Even in cases where classification results are less reliable, the application of SML can still provide valuable insights into the quality of manual content analyses.

Question 13

Q

(3) In section 4, the authors review “unsupervised methods”.
(3.1) How are “unsupervised methods” different from “supervised methods”?

Answer

A

The main difference between supervised and unsupervised text as data methods is that unsupervised techniques do not require a pre-defined conceptual structure, while supervised techniques rely on a theoretically informed collection of key terms or a manually coded sample of documents to specify what is conceptually interesting about the material. Unsupervised methods work inductively and use algorithm-based techniques to help researchers discover latent features of texts.

Question 14

Q

(3.2) The authors give a somewhat lengthier explanation of two particular unsupervised methods, namely LDA and STM. Try to get a basic understanding of what these two methods are.

Answer

A

Unsupervised text as data techniques are useful for conflict research as they have the potential to uncover underlying clusters and structures in large amounts of texts. Topic modeling is the most commonly used unsupervised method in conflict research, with the latent Dirichlet allocation (LDA) and Structural Topic Model (STM) being the most frequently applied algorithms.

Whereas the LDA algorithm assumes that topical prevalence (the frequency with which a topic is discussed) and topical content (the words used to discuss a topic) are constant across all documents, the STM allows to incorporate covariates into the algorithm which can illustrate potential variation in this regard.

Question 15

Q

(3.3) What are the advantages and disadvantages of unsupervised methods? (see also “Conclusion”)

Answer

A

Topic modeling is a popular unsupervised text analysis method that has great potential in conflict research. It starts with cleaning the text corpus and making model specifications such as determining the number of topics to be inferred from the corpus. The algorithm groups terms to form topics in each document, and researchers are responsible for labeling and interpreting these topics. Topic modeling has been used in various conflict-related studies, such as analyzing public statements on the use of force in Russia, the influence of external threats on US media themes, and patterns of speaking about Muslims and Islam in a Swedish Internet forum.

Question 16

Q

(4) In section 5, the authors discuss “Techniques of Cross-Validation”.
(4.1) In the beginning of the section, the authors discuss four principles of quantitative text analysis based on Grimmer and Stewart (2013). These principles are important, so make sure you spend time on understanding what they mean.

Answer

A

(1) Have a dictionary
(2) Take a look at the frequency of the dictionary words in the document
(3) Based on the frequency, classify the documents based on different categories
(4) IMPORTANT: VALIDATION!!! ALWAYS VALIDATE!!!

In their groundbreaking article on automated content analysis of political texts, Grimmer and Stewart suggest four principles of this method: (1) While all quantitative models of language are wrong, some are indeed useful. (2) Automated text analysis can augment, but not replace thorough reading of texts. (3) There is no universally best method for quantitative text analysis. (4) Validate, validate, validate. It is particularly the latter point which we would like to emphasize in this section. Automated text analysis has the potential to significantly reduce the costs and time needed for analyzing a large amount of texts in conflict research— yet, such methods should never be used blindly and without meticulous validation procedures that illustrate the credibility of the output.

Question 17

Q

(4.2) What techniques of validation do the authors suggest?

Answer

A

The article recommends combining dictionary approaches and supervised/unsupervised techniques for cross-validation in automated text analysis. The first procedure is an inductive cycle of cross-validation, starting with topic modeling and moving to the formulation of new theories, targeted dictionary or coding schemes, and supplementary supervised analyses. This approach is useful for exploring new theoretical relations and conceptual structures in large amounts of unknown texts. The framework allows for a thorough cross-validation of the different analytic steps and is a comprehensive way of accessing new information about the nature of conflicts.

The second procedure of cross-validation is a deductive approach where the researcher has an existing theoretical framework in mind when developing a dictionary or coding scheme for supervised learning. The first step is followed by a topic model applied to the same corpus of texts to additionally explore hidden features in the material that might be of theoretical interest but are not yet captured by the dictionary or coding scheme. The outcome of the topic modeling has the potential to validate but also significantly complement and refine the existing dictionary or coding scheme, leading to more solid results.

Question 18

Q

Christopher B.
Chapp and Kevin Coe (2019): “Religion in American Presidential
Campaigns, 1952 2015: Applying a New Framework for Understanding Candidate
Communication”. Journal for the Scientific Study of Religion 58(2): 398 414.

Answer

A

God words (e.g., the Almighty, the Creator)
Bible words (e.g., the Gospel, book of Job)
Physical manifestations of religion (e.g., church, crucifix)

etc.

Question 19

Q

What are the advantages and disadvantages of dictionary analysis?

Answer

A

Advantages
All the advantages of a quantitative approach (e.g., enables the study of large amounts of material); Relatively simple; Transparent; Easy to interpret; Computationally cheap (you do not need several days to run it

Disadvantages
All the disadvantages of a quantitative approach (e.g., does not allow a deeper understanding of particular cases ); It may be problematic to apply dictionaries from one context to another
one (words mean different things in different contexts); For
example, the religious dictionary developed for the purposes of the USA cannot be applied in a Muslim majority country);
Dictionaries do not capture modifiers (“extreme crisis” vs. “crisis”); Much of the success is dependent on the dictionary and how good it is:
Your dictionary may not capture all the important words

Question 20

Q

Can words be weighted differently?

Answer

A

Usually each word is weighed equally and counts as one, Sometimes you may want to give more emphasis to some words. For example: maybe you find it very important if the word “god” from religion dictionary is included in the texts you study… give the word “god” the weight 1.3

Question 21

Q

What is Zipf’s law?

Answer

A

Zipf’s law was originally formulated in terms of quantitative
linguistics, stating that given some corpus of natural
language utterances, the frequency of any word
is inversely proportional to its rank in the frequency
table . Thus the most frequent word will occur approximately
twice as often as the second most frequent word, three
times as often as the third most frequent word, etc.

Question 22

Q

What are stopwords and what is stemming?

Answer

A

Stopwords are words like “the”, “a”, that can be removed from the analysis. Stemming is to take the root of a word and use it as a base e.g. like “chang” for “change”, “changing” etc.