Session 1 Flashcards
Social media problem
- not much training data
- more variety in language (e.g. slang, typos)
- quick evolving
- identity work
- UNK tokens, bad data quality, Domain shift
Social media solutions
- Adapt training data
- Overcoming / make model robust against UNKs
- Adapt test data (normalization)
- Adapt training data
Option 1 = Manual annotation: expensive, never ends
• Option 2 = Use model (uptraining/selftraing): quality questionable
• Option 3 = adapt through language modeling (see lecture 3) (Van der Goot(2017) ->
train word2vec on twitter data, train POS tagger initialized with word embeddings on
annotated news data)
• Option 4 = inject „noice“ into annotated training data: less explored
- Overcoming / make model robust against UNKs
• Using n-gram based model = limited to „known“ n-grams
• Solution: use character n-grams
- Adapt test data (normalization)
• More like standard data by normalizing
• Issue: often dont want to normalize because info lost (e.g. capital letter means smth),
highly subjective what is „standard“
• Lexical normalization (= Word level normalization)
= task of transforming an utterance into its standard form, word by word, including
both 1-to-many and many-to-1 replacements (Allow split word into 2 or merge 2
words but not switch words or order)
o Benchmark: LexNorm 519 anno
How to solve normalisation?
- models generate candidates & rank them (based on distance metric)
▪ Monoise (state-of-the-art until 2021, van der Goot & van Noord, 2017)
• A) Genearte candidates
o Spell-check (Aspell)
o Word embeddings trained on twitter data
o Lookup list (with normalized words)
• B) Rank candidates - Transform to sequence labeling
• Convert tokens to character edits: label space from 10.000 -> 500 Eng. - Machine Translation (MT)
• Data hungry -> Statistical MT
• UFAL (stat-of-the-art): Uses Byt5 =
transformer-based char-level MT,
translates 1 word per time to maximize
content, generate a lot of synthetic data
Normalisation evaluation
- issue: many different metrices used -> not possible
to compare normalization corpi - Accuracy: measures how many correct, for some languages score
could be good for others bad - solution: Error Reduction Rate (ERR)
Err
- % of problem that was solved = word-level
accuracy normalized for the number of replacements in the dataset - error reduction rate
- accuracy - % words not normal / 100- % words not normal
Einstein: Used Solutions & Critic
• I argue that the two main computational approaches to dealing with bad language —
normalization and domain adaptation — are based on theories of social media
language that are not descriptively accurate.
1. Normalization: adapting text to fit the tools
o The logic of normalization presupposes that the “norm” can be identified
unambiguously, and that there is a direct mapping from non-standard words
to the elements in this normal set.
o Normalization is often impossible without changing the meaning of the text.
2. Domain adaption: adapting tools to fit the text
o E.g. preprocessing, new annotation schemes
o By adopting a model of “domain adaptation,” we con- fuse a medium with a
coherent domain. Adapting language technology towards the median Tweet
can improve accuracy on average, but it is certain to leave many forms of
language out.
o Social media is not coherent domain; Twitter itself is not a unified genre, it is
composed of many different styles and registers
Einstein: Lexical coherence of social media
• The internal coherence of social media — and its relationship to other types of text —
can be quan- tified in terms of the similarity of distributions over bigrams. The
relationship between OOV rate and do- main adaptation has been explored by
McClosky et al. (2010), who use it as a feature to predict how well a parser will
perform when applied across domains
Einstein: Oov
- out-of-vocabulary bigrams increases over time
- steadily increasingly rate of -> we cannot annotate our
way out of the bad language problem. An NLP system trained from data gathered in
January 2010 will be increasingly outdated as time passes - These rates rise monotonically as the time gap increases -> may reflect the diverse language of the
different types of authors who post throughout the day. - social media = moving target