lecture 12 - textual/content analysis + big data Flashcards
textual documents
= major sources of info in IR/Polsci
- official/public documents and records = gov., parliament, courts, parties, etc. = official documents
- personal documents = letters, emails, diaries
- cultural documents = mass meida (newspaper, tv), entertainment, film, literature, art, cartoons
!important to understand how people think about politics - social media (big data) = twitter, facebook, instagram
- research data: questionnaires, interview transcripts etc.
stored in archives, sometimes online databases
need for textual analysis - different qualitative and quantitative approaches (e.g. counting how often a word is used)
- discourse analysis
- content analysis
discourse analysis vs content analysis
discourse =
- interpretative
- starts with assumptions about the real world (e.g. marxism, feminism, post-colonialism)
- puts text in contestual context
- is very much about the interpretation of the researcher based on how he/she thinks the world works
content =
- systematic qualitative and/or quantitative analysis
discourse analysis
not necessarily steps, text is used as illustration of how the world works acc to the researcher
intention of source, effects on audience and context all flow together
- interpretive and constructivist approach: don’t assume there is an objective reality that can be observed, it is the interpretation that matters
- idea that texts reflect underlying structures and that discourse analysis allows to find it
- meanings are socially and discursively constructed (uncover how discursive practices construct meanings through production, dissemination, consumption of texts)
- interaction of discourse and context is essential
e.g.
post-structuralism, speech act theory, critical discourse analysis
!is not an easy method
validity = seen as plausibility and credibility (like in ethnographic research)
textual analysis
= quantitative content analysis as starting point
- manifest content = what is in the text (e.g. how often a word is used, a name, a topic is mentioned)
- description and comparison
e.g. comparison over time
is driven by the content that there is
but also more interpretative -> qualitative content analysis
- intention of source
- social, political, economic context (to give the content meaning)
- effects on audience (can’t be told by only looking at the source) = hard for content analysis
content analysis
- systematic analysis of textual info
- unobtrusive method of data collection: data is already there (no need to experiment, do survey)
-> eliminates some threats to validity, e.g. reactivity (text can’t respond to being observed)
-> easy access to study objects
-> not restricted on time dimensions
quantitative = focus on manifest content (what is written literally)
- frequency and valence of words
qualitative content analysis = focus on latent content: interpretation of the meaning of words, of the context
usually: research in between quantitative and qualitative, a combination
quantitative content analysis: definitions
content analysis is a research technique for the objective, systematic, and quantitative description of the manifest content of communication
(Berelson 1952)
- objective as bias free and replicable
- systematic: explicit and consistent (coding) rules and procedures = how to summarize etc.
- quantitative = quantifiable, using numerical variables
= positivist approach for content analysis
content analysis: steps
1
unit of analysis & case selection: type of text, accessibility, unit (size, section, duration)
population & sampling
- time period
- if population is too large, selection procedure for representative sample
probability sampling rarely done
nonprobability/systematic sampling, e.g. time interval, ?’virtual month’ (= taking a week from random months, so that 4 weeks are covered spread across the year -> representable (e.g. take week 1 of february, week 2 june, week 3 august, week 4 november)
categories :
- content categories
mutually exclusive & exhaustive: needs to cover all possible actors/categories + everyone needs to fit in only one category
manifest (e.g. names, roles) vs latent (e.g. words with positive vs negative meaning)
intensity (how often something is measured) vs valence (evaluations: positive, negative, needs some form of interpretation)
- how do we get categories= development
- starting with theory: Deductive -> a priori codes -> closed coding (apply classes you already have to classify the text)
- inductive -> grounded codes (dev. through observation) -> open coding (categorize whilst reading)
- unit of measurement
recording unit / unit of content = not just a newspaper, will you code each article, each paragraph, each sentence?
- physical unit = e.g. every square cm 1 code
- symbolic units:
syntactical (discrete units of language, e.g. words, sentence, paragraph, articles/stories) vs
referential (physical or temporal units, e.g. every time a certain event, person or object is mentioned) vs
thematic (topics within messages, e.g. migration)
- Coding:
a priori coding = defined in advance = typical approach of quantitative content analysis
- create a codebook (lists categories) with coding rules, coding categories and codes
coding by recording codes in a coding sheet
open/grounded coding = dev. by reading, coding in the text= done primarily by qualitative analysis, also quantitive analysis when wanting to make a codebook
- coding process: creating a coding protocol, creating codes to tag text, coding of content by assigning tags to text
coding and categories
content categories
- mutually exclusive: all ‘‘contents’’ need to fit in one category (something should not both be positive and negative, male and female, observer and participant e.g.)
- exhaustive: needs to cover all possible actors/categories
- manifest = literal (e.g. names, roles)
- latent = upon interpretation (e.g. are words positive vs negative meaning), need to explain how you categorize
- can measure intensity (how often something is measured)
- can measure valence (evaluations: positive, negative, needs some form of interpretation)
how do you get categories?
quantitative = deductive - a priori codes - closed coding (apply existing codes to classify the text)
qualitative = inductive - grounded codes - open codes
(develop code through observation/reading the text)
coding:
qualitative = grounded codes = create a coding protocol (e.g. when I encounter a new political actor I will create a code/tag), tag whilst reading and then afterward classify and code/summarize the content
quantitative = ** a priori coding** = codebook (lists categories and codes), apply the codes to the category/unit you want to analyse
*use a code sheet to enter the codes in (a table with the categories, e.g. ID, newspaper and then add the codes from the codebook)
discourse analysis fancy definition to memorize
interpretative and constructivist type of analysis that explores the ways in
which discourses (language, ideas, concepts, categories) give legitimacy and
meaning to social practices and institutions in a particular historical situation
or context;
example: manifesto project
analyzes/summarizes party platforms (56 countries)
- ideological positions, issues of the parties, policy positions (57policy categories)
- manual coding of manifestos (quasi-sentences), later also using wordscores
it checks how much attention is given to certain issues + what sides are picked -> determine what parties find important, what their positions are and what their ideological position is
freely available info. for researchers
NL parties: election scatter plot lef-right and international peace dimension
US: parties over time on left-right dimension
content analysis: humans vs computer
manual content analysis
- coders enter codes in coding sheets (hardcopy or digital)
- time-consuming activity
- intercoder reliability necessary (you can calculate this: see if how coders code the same texts correlates)
computer-assisted content analysis
- qualitative: help & manage manual coding (e.g. annotating)
e.g. safes what you enter - quantitative:
dictionary-based automatic computer coding (codes when certain words are mentioned, researcher has to come up with a full dictionary in advance)
wordscores = coding positions using reference text
wordfish = uses statistical models to estimate the probability that words occur together to classify
AI?
reliability vs validity
- reliability = computer
- validity = manual/humans
(*computer does nothing with latent meaning + can’t add new codes, humans can)
example: CEDS
= computational event data system = discontinued now
- machine coding syste for generatng even data using pattern recognition and simple linguistic parsing
- input: Reuters and Agence France Presse (AFP) news agency texts (Lexis-Nexis)
- processing: identifies source/subject, verb phrase and tget/object
- Use of coding dictionaries
e.g. conflict Israel-Palestine: whether they cooperated + where in conflict (only in 1990s: Oslo peace accrods there was cooperation)
- weekly weighted events
content analysis: analysis
- quantitative content analysis = tables, figures, statistical analysis
- qualitative content analysis = quotation, concept maps, narrative
content analysis: reliability and validity
quantitative = intercoder reliability
- coder stability = let same coder code the same text at different points of time
see if the coder is consistent - reproducibility: different coders code the same time (using the same coding scheme) the data consistently
- objectivity: different coders code/interpret same data consistently
reliability for qualitative content analysis = plausibility, see if the interpretation/conclusions make sense