Session 1 Flashcards

Question 1

Q

Social media problem

Answer

A

not much training data
more variety in language (e.g. slang, typos)
quick evolving
identity work
UNK tokens, bad data quality, Domain shift

Question 2

Q

Social media solutions

Answer

A

1. Adapt training data
1. Overcoming / make model robust against UNKs
1. Adapt test data (normalization)

Question 3

Q

Adapt training data

Answer

A

Option 1 = Manual annotation: expensive, never ends
• Option 2 = Use model (uptraining/selftraing): quality questionable
• Option 3 = adapt through language modeling (see lecture 3) (Van der Goot(2017) ->
train word2vec on twitter data, train POS tagger initialized with word embeddings on
annotated news data)
• Option 4 = inject „noice“ into annotated training data: less explored

Question 4

Q

Overcoming / make model robust against UNKs

Answer

A

• Using n-gram based model = limited to „known“ n-grams
• Solution: use character n-grams

Question 5

Q

Adapt test data (normalization)

Answer

A

• More like standard data by normalizing
• Issue: often dont want to normalize because info lost (e.g. capital letter means smth),
highly subjective what is „standard“
• Lexical normalization (= Word level normalization)
= task of transforming an utterance into its standard form, word by word, including
both 1-to-many and many-to-1 replacements (Allow split word into 2 or merge 2
words but not switch words or order)
o Benchmark: LexNorm 519 anno

Question 6

Q

How to solve normalisation?

Answer

A

models generate candidates & rank them (based on distance metric)
▪ Monoise (state-of-the-art until 2021, van der Goot & van Noord, 2017)
• A) Genearte candidates
o Spell-check (Aspell)
o Word embeddings trained on twitter data
o Lookup list (with normalized words)
• B) Rank candidates
Transform to sequence labeling
• Convert tokens to character edits: label space from 10.000 -> 500 Eng.
Machine Translation (MT)
• Data hungry -> Statistical MT
• UFAL (stat-of-the-art): Uses Byt5 =
transformer-based char-level MT,
translates 1 word per time to maximize
content, generate a lot of synthetic data

Question 7

Q

Normalisation evaluation

Answer

A

issue: many different metrices used -> not possible
to compare normalization corpi
Accuracy: measures how many correct, for some languages score
could be good for others bad
solution: Error Reduction Rate (ERR)

Question 8

Q

Err

Answer

A

% of problem that was solved = word-level
accuracy normalized for the number of replacements in the dataset
error reduction rate
accuracy - % words not normal / 100- % words not normal

Question 9

Q

Einstein: Used Solutions & Critic

Answer

A

• I argue that the two main computational approaches to dealing with bad language —
normalization and domain adaptation — are based on theories of social media
language that are not descriptively accurate.
1. Normalization: adapting text to fit the tools
o The logic of normalization presupposes that the “norm” can be identified
unambiguously, and that there is a direct mapping from non-standard words
to the elements in this normal set.
o Normalization is often impossible without changing the meaning of the text.
2. Domain adaption: adapting tools to fit the text
o E.g. preprocessing, new annotation schemes
o By adopting a model of “domain adaptation,” we con- fuse a medium with a
coherent domain. Adapting language technology towards the median Tweet
can improve accuracy on average, but it is certain to leave many forms of
language out.
o Social media is not coherent domain; Twitter itself is not a unified genre, it is
composed of many different styles and registers

Question 10

Q

Einstein: Lexical coherence of social media

Answer

A

• The internal coherence of social media — and its relationship to other types of text —
can be quan- tified in terms of the similarity of distributions over bigrams. The
relationship between OOV rate and do- main adaptation has been explored by
McClosky et al. (2010), who use it as a feature to predict how well a parser will
perform when applied across domains

Question 11

Q

Einstein: Oov

Answer

A

out-of-vocabulary bigrams increases over time
steadily increasingly rate of -> we cannot annotate our
way out of the bad language problem. An NLP system trained from data gathered in
January 2010 will be increasingly outdated as time passes
These rates rise monotonically as the time gap increases -> may reflect the diverse language of the
different types of authors who post throughout the day.
social media = moving target