Lecture 3 Flashcards by Agnes Andblad

How the computer changed content analysis

Databases back then - with different data
Modern society: very online based > revolution of content analysis as a method

How well did you know this?

Not at all

Perfectly

Manuals and syntax in exam: KALPHA judges= v3.1 v3.2 v3.3/level=4/detail=0/boot=5000. Recognise what is the intercoder reliability

CHECK ONLINE

How well did you know this?

Not at all

Perfectly

What do we mean with digital media?

Online news
Websites
Blogs
Apps
Online fora
Social media content
Other characteristics than traditional media
Research methods different

How well did you know this?

Not at all

Perfectly

Difference

Content actually exists, finished static object. After remixes, mashup etc this is a problematic assumption.

Study the first publication? Extreme content? Quick in collecting this.

How well did you know this?

Not at all

Perfectly

Aiming at a moving target

Digital content is ‘chaotic’
Data is - moving (position on site, relation to other articles)
Data is - changing, has different shapes and sizes. Data can grow extensively.
Data is - varying content (comments, hyperlinks)

What will you take into account? Posts, links or likes?

How well did you know this?

Not at all

Perfectly

Sampling digital data

More complex than sampling traditional data
Unit of analysis and registration unit is often diverse
Dynamic character of data: extra challenge
More garbage unrelated content (spam, not working links etc)
Hard to recognise and exclude unrelated content from the sample on forehand (irrelevant units)

How well did you know this?

Not at all

Perfectly

Limits to accessibility of digital data

Commercial data often protected of terms of service (TOS) Ask for permission from META

Research partnerships lack independence and not accessible for all

Facebook is now cooperating with researchers.

Proprietary (bought) data: replicability not possible!
Forums are often not public: consent required

How well did you know this?

Not at all

Perfectly

How digitalisation changed CA: crowd coding

> traditionally coding team: researchers, students

Why not outsource to the internet crowd?

Cheaper! 2 cents per coded headline. 1500 headlines: 50 dollars.
Faster! Team of 200 coders? Coding at the same time.

More reliable! Because in a team it can be biassed: more systematic though.

Signals ambiguity in the data! = new insights in the material.

Use two coders and a third coder for cases of disagreement.

How well did you know this?

Not at all

Perfectly

What is CCA?

Three categories

Dictionary approach
Supervised machine learning
Unsupervised machine learning

Computational content analysis, also known as ATA, CCA, CATA; CATA

How well did you know this?

Not at all

Perfectly

What do we mean by CCA?

CCA stands for all the content analysis approaches that are aided by the computer when collecting, coding or interpreting data
The role of the computer can be modest, or substantial

How well did you know this?

Not at all

Perfectly

Advantages of using CCA

Enables coping with data growth
Try to automate it to keep track of the information

More efficient: ACA can save time and money (developing software is time-consuming)

Computers are 100% reliable - getting a reliable coding instrument can be difficult.

Why reliable? A computer will do what is told. Treat the instruction based on the instructions you have been giving.

DIscover unknown patterns: ACA can recognise patterns not visible for human eye

How well did you know this?

Not at all

Perfectly

Three types of CCA approaches

Deductive (rules by researcher in codebook) → inductive (rules are determined by the computer (no codebook)

Counting and dictionary (deductive) → supervised machine learning → unsupervised machine learning (inductive)

How well did you know this?

Not at all

Perfectly

I: Counting and dictionary (We did it at school)

Rule based by researcher
Simple tasks that involve the counting of things
Examples: The number of references to a person or issue
All you need is: a searchable database (Lexis uni) and a keyword or combination of keywords
Can also be short (visibility of US president) or long

How well did you know this?

Not at all

Perfectly

Limitations of dictionary approach:

Not suited to measure latent concepts
Dictionaries are handmade: very labour intensive!

In case of a big data with unknown characteristics, it’s not suitable: you can’t draw a representative sample

Dictionaries are topic specific: don’t work well in other domains. A sentiment for sports news is not good for financial news.
Not so popular anymore.

How well did you know this?

Not at all

Perfectly

II: Supervised Machine Learning (in between deductive and inductive)

We do apply rules but also let’s it do itself

Basic idea: the algorithm tried to replicate human coding decisions

There is a training set, which has been manually coded

Computer studies the training set, and decisions made by researchers and tries to find patterns

Can be used to code genres, frames, sentiment, subjectivity and topics

How well did you know this?

Not at all

Perfectly

Objects are categories by the researcher (what category it belongs to) after training the data

Study These Flashcards

Ask the computer - where would you code it and how? Distinguish from the pear for instance.
What if something looks similar - can still make mistakes. Need practice
Looks for characteristics in those categories, and locate new examples

The three/four steps of SML

Study These Flashcards

Training set
Sample and correct labels: train the classifier
Used to classify the other documents

Advantages of Supervised ML

Study These Flashcards

Performance is transparent (manual training set)
Classifier can c
ode as good as a human coder and is 100% reliable when trained properly

The classifier can be applied to an infinite amount of texts

Note on SML

Study These Flashcards

The classification decisions are unknown: you only know the output!

III: Unsupervised machine learning - entirely inductive

Study These Flashcards

No instructions for an outcome - rely on computer

When shopping: ‘maybe you like these too’. Algo provides examples.

Inductive approach: the algo seeks patterns in the data without advance from researcher
Analysis is not limited to key terms (like the dictionary approach) but takes the entire text into account

Unsuper ML:

Study These Flashcards

Co-occurring terms indicate a texts meaning
Relies on clustering methods to find patterns
There is no standard to compare to the output which makes it harder to evaluate
Rely on computer

Seemingly unimportant choices can have large consequences
Does not take the language of nature into account!
It may contain hidden bias - also in output?

(EXAM question) Which method performs best?
Van Atteveldt: Comparison of sentiment analysis approach in politics

Study These Flashcards

Manual coders
Crowd coders
Dictionaries
Supervised ML (classic)
Supervised ML (deep)

Results: The superior method was manual coders! Second crowd coding - least dictionaries
(English) very poor accuracy.

Highlight of the race

Study These Flashcards

Student coders the best
Only trained students and crowd coding scored levels of agreement for as valid measurement

Article 2: Badens three concerns with CTA
1:

Study These Flashcards

Technology before validity (Operationalisation is replaced by algo, replacing validity for predictive performance)
For UML it’s even worse: validity checked by interpretability of findings
No attention for social science insights about measuring langage
No operational justification for algorithmic models

2. Specialisation before integration

a.) most CTA work well, at one trick ponies, only capable of 1 task b.) CTA tools are often standalone tools, cannot combine c.) Use of CTA erodes important conceptual distinctions Consequence: not suitable for more complicated textual measurements Claim something they don't live up to

3. English before everything

a.) Tools in English or Germanic b.) Little attention to language specific differences Consequence: Weird research *Western, Educated, Industrialised, Rich and Developed

Baden key takeaways

– Why not everyone is using CTAM Plea: social scientists need to remain in control of the research agenda

Way forward:

Tool developers and social scientists must learn from each other: knowledge Focus on measurement validity: show how it works Create CTAM platforms, combining tools Stimulate non english CTA research and cross-lingual cooperation

So, humans are still supreme? Until 2 years ago we thought so - Generative AI = ChatGTP → reshaping our field drastically Crowd workers: Use ChatGTP to do their tasks ChatGTP is higher accuracy and higher reliability (lower bias). Can take surrounding information into account. A new paradigm of text as data research = Intercoder reliability was sky high!

EXAM questions:

What can be said of dictionary approach: It is manually constructed

Lecture 3 Flashcards

(30 cards)