Lecture 3 Flashcards

1
Q

How the computer changed content analysis

A

Databases back then - with different data
Modern society: very online based > revolution of content analysis as a method

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Manuals and syntax in exam: KALPHA judges= v3.1 v3.2 v3.3/level=4/detail=0/boot=5000. Recognise what is the intercoder reliability

A

CHECK ONLINE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What do we mean with digital media?

A

Online news
Websites
Blogs
Apps
Online fora
Social media content
Other characteristics than traditional media
Research methods different

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Difference

A

Content actually exists, finished static object. After remixes, mashup etc this is a problematic assumption.

Study the first publication? Extreme content? Quick in collecting this.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Aiming at a moving target

A

Digital content is ‘chaotic’
Data is - moving (position on site, relation to other articles)
Data is - changing, has different shapes and sizes. Data can grow extensively.
Data is - varying content (comments, hyperlinks)

What will you take into account? Posts, links or likes?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Sampling digital data

A

More complex than sampling traditional data
Unit of analysis and registration unit is often diverse
Dynamic character of data: extra challenge
More garbage unrelated content (spam, not working links etc)
Hard to recognise and exclude unrelated content from the sample on forehand (irrelevant units)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Limits to accessibility of digital data

A

Commercial data often protected of terms of service (TOS) Ask for permission from META

Research partnerships lack independence and not accessible for all

Facebook is now cooperating with researchers.

Proprietary (bought) data: replicability not possible!
Forums are often not public: consent required

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How digitalisation changed CA: crowd coding

A

> traditionally coding team: researchers, students

Why not outsource to the internet crowd?

Cheaper! 2 cents per coded headline. 1500 headlines: 50 dollars.
Faster! Team of 200 coders? Coding at the same time.

More reliable! Because in a team it can be biassed: more systematic though.

Signals ambiguity in the data! = new insights in the material.

Use two coders and a third coder for cases of disagreement.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is CCA?

A

Three categories

Dictionary approach
Supervised machine learning
Unsupervised machine learning

Computational content analysis, also known as ATA, CCA, CATA; CATA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What do we mean by CCA?

A

CCA stands for all the content analysis approaches that are aided by the computer when collecting, coding or interpreting data
The role of the computer can be modest, or substantial

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Advantages of using CCA

A

Enables coping with data growth
Try to automate it to keep track of the information

More efficient: ACA can save time and money (developing software is time-consuming)

Computers are 100% reliable - getting a reliable coding instrument can be difficult.

Why reliable? A computer will do what is told. Treat the instruction based on the instructions you have been giving.

DIscover unknown patterns: ACA can recognise patterns not visible for human eye

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Three types of CCA approaches

A

Deductive (rules by researcher in codebook) → inductive (rules are determined by the computer (no codebook)

Counting and dictionary (deductive) → supervised machine learning → unsupervised machine learning (inductive)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

I: Counting and dictionary (We did it at school)

A

Rule based by researcher
Simple tasks that involve the counting of things
Examples: The number of references to a person or issue
All you need is: a searchable database (Lexis uni) and a keyword or combination of keywords
Can also be short (visibility of US president) or long

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Limitations of dictionary approach:

A

Not suited to measure latent concepts
Dictionaries are handmade: very labour intensive!

In case of a big data with unknown characteristics, it’s not suitable: you can’t draw a representative sample

Dictionaries are topic specific: don’t work well in other domains. A sentiment for sports news is not good for financial news.
Not so popular anymore.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

II: Supervised Machine Learning (in between deductive and inductive)

A

We do apply rules but also let’s it do itself

Basic idea: the algorithm tried to replicate human coding decisions

There is a training set, which has been manually coded

Computer studies the training set, and decisions made by researchers and tries to find patterns

Can be used to code genres, frames, sentiment, subjectivity and topics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Objects are categories by the researcher (what category it belongs to) after training the data

A

Ask the computer - where would you code it and how? Distinguish from the pear for instance.
What if something looks similar - can still make mistakes. Need practice
Looks for characteristics in those categories, and locate new examples

17
Q

The three/four steps of SML

A

Training set
Sample and correct labels: train the classifier
Used to classify the other documents

18
Q

Advantages of Supervised ML

A

Performance is transparent (manual training set)
Classifier can c
ode as good as a human coder and is 100% reliable when trained properly

The classifier can be applied to an infinite amount of texts

19
Q

Note on SML

A

The classification decisions are unknown: you only know the output!

20
Q

III: Unsupervised machine learning - entirely inductive

A

No instructions for an outcome - rely on computer

When shopping: ‘maybe you like these too’. Algo provides examples.

Inductive approach: the algo seeks patterns in the data without advance from researcher
Analysis is not limited to key terms (like the dictionary approach) but takes the entire text into account

21
Q

Unsuper ML:

A

Co-occurring terms indicate a texts meaning
Relies on clustering methods to find patterns
There is no standard to compare to the output which makes it harder to evaluate
Rely on computer

Seemingly unimportant choices can have large consequences
Does not take the language of nature into account!
It may contain hidden bias - also in output?

22
Q

(EXAM question) Which method performs best?
Van Atteveldt: Comparison of sentiment analysis approach in politics

A

Manual coders
Crowd coders
Dictionaries
Supervised ML (classic)
Supervised ML (deep)

Results: The superior method was manual coders! Second crowd coding - least dictionaries
(English) very poor accuracy.

23
Q

Highlight of the race

A

Student coders the best
Only trained students and crowd coding scored levels of agreement for as valid measurement

24
Q

Article 2: Badens three concerns with CTA
1:

A

Technology before validity (Operationalisation is replaced by algo, replacing validity for predictive performance)
For UML it’s even worse: validity checked by interpretability of findings
No attention for social science insights about measuring langage
No operational justification for algorithmic models

25
Q
  1. Specialisation before integration
A

a.) most CTA work well, at one trick ponies, only capable of 1 task
b.) CTA tools are often standalone tools, cannot combine
c.) Use of CTA erodes important conceptual distinctions

Consequence: not suitable for more complicated textual measurements
Claim something they don’t live up to

26
Q
  1. English before everything
A

a.) Tools in English or Germanic
b.) Little attention to language specific differences
Consequence: Weird research *Western, Educated, Industrialised, Rich and Developed

27
Q

Baden key takeaways

A

– Why not everyone is using CTAM
Plea: social scientists need to remain in control of the research agenda

28
Q

Way forward:

A

Tool developers and social scientists must learn from each other: knowledge
Focus on measurement validity: show how it works
Create CTAM platforms, combining tools
Stimulate non english CTA research and cross-lingual cooperation

29
Q

So, humans are still supreme?
Until 2 years ago we thought so - Generative AI = ChatGTP → reshaping our field drastically
Crowd workers: Use ChatGTP to do their tasks

ChatGTP is higher accuracy and higher reliability (lower bias). Can take surrounding information into account. A new paradigm of text as data research = Intercoder reliability was sky high!

A
30
Q

EXAM questions:

A

What can be said of dictionary approach:
It is manually constructed