Lecture 3 Flashcards
How the computer changed content analysis
Databases back then - with different data
Modern society: very online based > revolution of content analysis as a method
Manuals and syntax in exam: KALPHA judges= v3.1 v3.2 v3.3/level=4/detail=0/boot=5000. Recognise what is the intercoder reliability
CHECK ONLINE
What do we mean with digital media?
Online news
Websites
Blogs
Apps
Online fora
Social media content
Other characteristics than traditional media
Research methods different
Difference
Content actually exists, finished static object. After remixes, mashup etc this is a problematic assumption.
Study the first publication? Extreme content? Quick in collecting this.
Aiming at a moving target
Digital content is ‘chaotic’
Data is - moving (position on site, relation to other articles)
Data is - changing, has different shapes and sizes. Data can grow extensively.
Data is - varying content (comments, hyperlinks)
What will you take into account? Posts, links or likes?
Sampling digital data
More complex than sampling traditional data
Unit of analysis and registration unit is often diverse
Dynamic character of data: extra challenge
More garbage unrelated content (spam, not working links etc)
Hard to recognise and exclude unrelated content from the sample on forehand (irrelevant units)
Limits to accessibility of digital data
Commercial data often protected of terms of service (TOS) Ask for permission from META
Research partnerships lack independence and not accessible for all
Facebook is now cooperating with researchers.
Proprietary (bought) data: replicability not possible!
Forums are often not public: consent required
How digitalisation changed CA: crowd coding
> traditionally coding team: researchers, students
Why not outsource to the internet crowd?
Cheaper! 2 cents per coded headline. 1500 headlines: 50 dollars.
Faster! Team of 200 coders? Coding at the same time.
More reliable! Because in a team it can be biassed: more systematic though.
Signals ambiguity in the data! = new insights in the material.
Use two coders and a third coder for cases of disagreement.
What is CCA?
Three categories
Dictionary approach
Supervised machine learning
Unsupervised machine learning
Computational content analysis, also known as ATA, CCA, CATA; CATA
What do we mean by CCA?
CCA stands for all the content analysis approaches that are aided by the computer when collecting, coding or interpreting data
The role of the computer can be modest, or substantial
Advantages of using CCA
Enables coping with data growth
Try to automate it to keep track of the information
More efficient: ACA can save time and money (developing software is time-consuming)
Computers are 100% reliable - getting a reliable coding instrument can be difficult.
Why reliable? A computer will do what is told. Treat the instruction based on the instructions you have been giving.
DIscover unknown patterns: ACA can recognise patterns not visible for human eye
Three types of CCA approaches
Deductive (rules by researcher in codebook) → inductive (rules are determined by the computer (no codebook)
Counting and dictionary (deductive) → supervised machine learning → unsupervised machine learning (inductive)
I: Counting and dictionary (We did it at school)
Rule based by researcher
Simple tasks that involve the counting of things
Examples: The number of references to a person or issue
All you need is: a searchable database (Lexis uni) and a keyword or combination of keywords
Can also be short (visibility of US president) or long
Limitations of dictionary approach:
Not suited to measure latent concepts
Dictionaries are handmade: very labour intensive!
In case of a big data with unknown characteristics, it’s not suitable: you can’t draw a representative sample
Dictionaries are topic specific: don’t work well in other domains. A sentiment for sports news is not good for financial news.
Not so popular anymore.
II: Supervised Machine Learning (in between deductive and inductive)
We do apply rules but also let’s it do itself
Basic idea: the algorithm tried to replicate human coding decisions
There is a training set, which has been manually coded
Computer studies the training set, and decisions made by researchers and tries to find patterns
Can be used to code genres, frames, sentiment, subjectivity and topics
Objects are categories by the researcher (what category it belongs to) after training the data
Ask the computer - where would you code it and how? Distinguish from the pear for instance.
What if something looks similar - can still make mistakes. Need practice
Looks for characteristics in those categories, and locate new examples
The three/four steps of SML
Training set
Sample and correct labels: train the classifier
Used to classify the other documents
Advantages of Supervised ML
Performance is transparent (manual training set)
Classifier can c
ode as good as a human coder and is 100% reliable when trained properly
The classifier can be applied to an infinite amount of texts
Note on SML
The classification decisions are unknown: you only know the output!
III: Unsupervised machine learning - entirely inductive
No instructions for an outcome - rely on computer
When shopping: ‘maybe you like these too’. Algo provides examples.
Inductive approach: the algo seeks patterns in the data without advance from researcher
Analysis is not limited to key terms (like the dictionary approach) but takes the entire text into account
Unsuper ML:
Co-occurring terms indicate a texts meaning
Relies on clustering methods to find patterns
There is no standard to compare to the output which makes it harder to evaluate
Rely on computer
Seemingly unimportant choices can have large consequences
Does not take the language of nature into account!
It may contain hidden bias - also in output?
(EXAM question) Which method performs best?
Van Atteveldt: Comparison of sentiment analysis approach in politics
Manual coders
Crowd coders
Dictionaries
Supervised ML (classic)
Supervised ML (deep)
Results: The superior method was manual coders! Second crowd coding - least dictionaries
(English) very poor accuracy.
Highlight of the race
Student coders the best
Only trained students and crowd coding scored levels of agreement for as valid measurement
Article 2: Badens three concerns with CTA
1:
Technology before validity (Operationalisation is replaced by algo, replacing validity for predictive performance)
For UML it’s even worse: validity checked by interpretability of findings
No attention for social science insights about measuring langage
No operational justification for algorithmic models
- Specialisation before integration
a.) most CTA work well, at one trick ponies, only capable of 1 task
b.) CTA tools are often standalone tools, cannot combine
c.) Use of CTA erodes important conceptual distinctions
Consequence: not suitable for more complicated textual measurements
Claim something they don’t live up to
- English before everything
a.) Tools in English or Germanic
b.) Little attention to language specific differences
Consequence: Weird research *Western, Educated, Industrialised, Rich and Developed
Baden key takeaways
– Why not everyone is using CTAM
Plea: social scientists need to remain in control of the research agenda
Way forward:
Tool developers and social scientists must learn from each other: knowledge
Focus on measurement validity: show how it works
Create CTAM platforms, combining tools
Stimulate non english CTA research and cross-lingual cooperation
So, humans are still supreme?
Until 2 years ago we thought so - Generative AI = ChatGTP → reshaping our field drastically
Crowd workers: Use ChatGTP to do their tasks
ChatGTP is higher accuracy and higher reliability (lower bias). Can take surrounding information into account. A new paradigm of text as data research = Intercoder reliability was sky high!
EXAM questions:
What can be said of dictionary approach:
It is manually constructed