13 Quantitative Methods - Quantitative Text Analysis Flashcards
Questions for preparation
Questions on Törnberg, Anton and Petter Törnberg (2016): “Muslims in social media discourse: Combining topic modeling and critical discourse analysis”. Discourse, Context and Media 13: 132-142.
(3.1) What methods do the authors use and why?
This article uses a combination of corpus-linguistic (CL) approach and Critical Discourse Analysis (CDA) to study the representation of Muslims and Islam in social media. The CL approach helps to deal with the vast amount of unstructured data, while CDA addresses methodological weaknesses in previous studies such as lack of academic rigor and small data sets. The study is placed within the field of Corpus-Assisted Discourse Studies (CADS), which combines Discourse Analysis and techniques for corpus enquiries from CL. The CL approach focuses on the analysis of words and their textual context using wordlists, keywords, collocations, and concordances, backed up with measures of statistical significance.
(3.2) Based on what you read, what do you find out about the method “topic modeling”?
Topic modeling is a methodology that uses Latent Dirichlet Allocation (LDA) algorithm to uncover the recurring clusters of co-occurring words in a document collection to quantify and visualize the themes that arise inductively from texts. The logic behind topic modeling is that a document about a certain topic is more likely to contain related words associated with that topic. LDA is based on Bayesian statistical theory, where topics and per-document topic proportions are seen as latent variables in a hierarchical probabilistic model, and the conditional distribution of those variables is approximated given an observed collection of documents.
(3.3) What material do the authors use?
Text: Use word frequencies for Trump tweets
For technical reasons, standard LDA generally works best for documents with a size of at least 1000 words. We therefore aggregate all posts from each individual user in a specific thread within the same time period into chunks of 1000-word documents. Posts that are significantly longer than 1000 words are split up into smaller chunks.
Questions on Benoit, Ken (2020): “Text as Data: An Overview”. In: Luigi Curini and Robert Franzese (eds): The SAGE Handbook of Research Methods in Political Science and International Relations. SAGE Publications Ltd. (pages to read: 461-468)
(1) Why is there demand for the kind of methodology, which is treating text as data?
Never before has so much text been so readily available on such a wide variety of topics that concern our discipline. Legislative debates, party manifestos, committee transcripts, candidate and other political speeches, lobbying documents, court opinions, laws – not only are all recorded and published today, but in many cases this is in a readily available form that is easily converted into structured data for systematic analysis. When processed into structured form, this textual record provides a rich source of data to fuel the study of politics. This revolution in the quantity and avail ability of textual data has vastly broadened the scope of questions that can be investigated empirically, as well as the range of political actors to which they can be applied. We will never have access to direct, physical measures of the content or intensity of such core concepts as ideology, commitment to democracy or differing preferences or priorities for competing policies. But what political actors say, more than the behaviour they exhibit, provides evidence of their true inner states.
Read “Text as Text versus Text as Data”. There is a lot of information in this section, and it takes time to digest it. Consider reading this section twice: Once before you have answered question 3, and the second time after you have answered question 3.
(2) What does it mean to treat text as data?
Treating texts as data means arranging it for the purpose of analysis, using a structure that probably was not partof the process that generated the data itself. This step starts with collecting it into a corpus, which involves defining a sample of the available texts, out of all otherpossible texts that might have been selected. Just as with any other research, theprinciples of research design govern how to choose this sample, and should beguided by the research question. Once selected, we then impose substantial selection and abstraction in the form ofconverting the selected texts into a more structured form of data.
What is CATA?
This article gives an overview of “computer-aided text analysis” (in essence, quantitative text analysis” in the context of conflict research. In case you are interested in quantitative text analysis, Table 3 is likely to be extremely valuable for you.
CATA:
Computer-aided text analysis (CATA) offers exciting new possibilities for conflict research that this contribution describes using a range of exemplary studies from a variety of disciplines including sociology, political science, commu nication studies, and computer science. This includes both conflict as it is verbalized in the news media, in political speeches, and other public documents and conflict as it occurs in online spaces (social media platforms, forums). Both (offline and online) work using inductive computational procedures, such as topic modeling, and supervised machine learning approaches are assessed, as are more traditional forms of content analysis, such as dictionaries. Finally, cross validation is highlighted as a crucial step in CATA, in order to make the method as useful as possible to scholars interested in enlisting text mining for conflict research.
(1.2) What are the advantages and disadvantages of dictionary approaches? (see also “Conclusion”)
The article discusses the use of dictionaries and machine translation in studying conflict across languages. While compiling a multilingual dictionary can be burdensome, machine translation can be used to convert a monolingual dictionary into a multilingual one. Person and place names, specific events, and actions can be captured by such a dictionary with relative accuracy. The article also mentions the use of data mining and machine learning software to generate dictionaries or resources. Studies have also relied on dictionaries to study conflict in virtual environments, such as the analysis of hate speech on the Facebook pages of extreme-right political parties in Spain. Simple word frequencies were used to group posts into thematic categories, and terms most frequently used within each group were extracted to yield category descriptions of different groups of immigrants and other “enemies.”
(1.3) What is sentiment analysis?
The method of sentiment analysis involves analyzing text to determine whether it expresses a positive or negative sentiment. This is often done using sentiment dictionaries, which categorize words into positive or negative sentiment. Sentiment dictionaries exist in many languages and text types and can be quite comprehensive. In binary classification, the logarithm of the ratio of positive to negative words is often used to calculate a weighted composite score.
(1.4) What are off-the-shelf dictionaries? How can dictionaries be created?
There are various dictionaries available for different applications such as policy areas, moral foundations, and standardized language use. These off-the-shelf dictionaries provide validity and can even be used for different languages. Dictionaries can be created using different techniques such as manual labeling to extract distinctive terms.
(1.5) How can dictionary analysis be validated?
Dictionary approaches, along with supervised and unsupervised learning, are effective at reducing complexity by turning words into category distributions. This is based on the bag-of-words philosophy of text analysis, which records only the frequency of each word within a text, but not where it occurs. This approach is useful for distilling aggregate meaning from data, but can result in a loss of structure and meaning in the pre-processing and cleaning of text. Dictionaries should always be validated using the data to which they are applied, as applying a dictionary to a different domain may result in inaccurate results. Systematically validating dictionary results, such as through content analysis, can help overcome these issues.
(2) In section 3, the authors review “supervised methods”.
(2.1) What is the idea/intuition behind supervised methods?
Supervised machine learning (SML) is an advanced technique that can be used to analyze textual material by connecting feature distribution patterns with human judgment. This technique involves categorizing textual material according to specific criteria and then training an algorithm to make predictions based on these categories. Once the classifier is trained, it can be used to classify new material. SML involves splitting the data into training and test sets, and using metrics to evaluate the classifier’s performance. This technique can be used to validate traditional content analysis, annotate unknown material, and discover relationships between external variables that predict language use.
(2.2) What are the advantages and disadvantages of supervised methods? (see also “Conclusion”)
Supervised machine learning (SML) has various applications in conflict research and social sciences. In traditional content analysis, SML can act as an “algorithmic coder” that aims to predict the consensus judgment of human coders treated as “ground truth”. However, the quality of annotation and the relation between content and code is key, and “ground truth” should be treated with care. SML can reliably predict the topic or theme of a text but can yield poor results for other categories, such as controversy and crime. Even in cases where classification results are less reliable, the application of SML can still provide valuable insights into the quality of manual content analyses.
(3) In section 4, the authors review “unsupervised methods”.
(3.1) How are “unsupervised methods” different from “supervised methods”?
The main difference between supervised and unsupervised text as data methods is that unsupervised techniques do not require a pre-defined conceptual structure, while supervised techniques rely on a theoretically informed collection of key terms or a manually coded sample of documents to specify what is conceptually interesting about the material. Unsupervised methods work inductively and use algorithm-based techniques to help researchers discover latent features of texts.
(3.2) The authors give a somewhat lengthier explanation of two particular unsupervised methods, namely LDA and STM. Try to get a basic understanding of what these two methods are.
Unsupervised text as data techniques are useful for conflict research as they have the potential to uncover underlying clusters and structures in large amounts of texts. Topic modeling is the most commonly used unsupervised method in conflict research, with the latent Dirichlet allocation (LDA) and Structural Topic Model (STM) being the most frequently applied algorithms.
Whereas the LDA algorithm assumes that topical prevalence (the frequency with which a topic is discussed) and topical content (the words used to discuss a topic) are constant across all documents, the STM allows to incorporate covariates into the algorithm which can illustrate potential variation in this regard.
(3.3) What are the advantages and disadvantages of unsupervised methods? (see also “Conclusion”)
Topic modeling is a popular unsupervised text analysis method that has great potential in conflict research. It starts with cleaning the text corpus and making model specifications such as determining the number of topics to be inferred from the corpus. The algorithm groups terms to form topics in each document, and researchers are responsible for labeling and interpreting these topics. Topic modeling has been used in various conflict-related studies, such as analyzing public statements on the use of force in Russia, the influence of external threats on US media themes, and patterns of speaking about Muslims and Islam in a Swedish Internet forum.