Session 1 Flashcards

1
Q

Social media problem

A
  • not much training data
  • more variety in language (e.g. slang, typos)
  • quick evolving
  • identity work
  • UNK tokens, bad data quality, Domain shift
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Social media solutions

A
    1. Adapt training data
    1. Overcoming / make model robust against UNKs
    1. Adapt test data (normalization)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q
  1. Adapt training data
A

Option 1 = Manual annotation: expensive, never ends
• Option 2 = Use model (uptraining/selftraing): quality questionable
• Option 3 = adapt through language modeling (see lecture 3) (Van der Goot(2017) ->
train word2vec on twitter data, train POS tagger initialized with word embeddings on
annotated news data)
• Option 4 = inject „noice“ into annotated training data: less explored

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q
  1. Overcoming / make model robust against UNKs
A

• Using n-gram based model = limited to „known“ n-grams
• Solution: use character n-grams

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q
  1. Adapt test data (normalization)
A

• More like standard data by normalizing
• Issue: often dont want to normalize because info lost (e.g. capital letter means smth),
highly subjective what is „standard“
• Lexical normalization (= Word level normalization)
= task of transforming an utterance into its standard form, word by word, including
both 1-to-many and many-to-1 replacements (Allow split word into 2 or merge 2
words but not switch words or order)
o Benchmark: LexNorm 519 anno

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How to solve normalisation?

A
  1. models generate candidates & rank them (based on distance metric)
    ▪ Monoise (state-of-the-art until 2021, van der Goot & van Noord, 2017)
    • A) Genearte candidates
    o Spell-check (Aspell)
    o Word embeddings trained on twitter data
    o Lookup list (with normalized words)
    • B) Rank candidates
  2. Transform to sequence labeling
    • Convert tokens to character edits: label space from 10.000 -> 500 Eng.
  3. Machine Translation (MT)
    • Data hungry -> Statistical MT
    • UFAL (stat-of-the-art): Uses Byt5 =
    transformer-based char-level MT,
    translates 1 word per time to maximize
    content, generate a lot of synthetic data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Normalisation evaluation

A
  • issue: many different metrices used -> not possible
    to compare normalization corpi
  • Accuracy: measures how many correct, for some languages score
    could be good for others bad
  • solution: Error Reduction Rate (ERR)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Err

A
  • % of problem that was solved = word-level
    accuracy normalized for the number of replacements in the dataset
  • error reduction rate
  • accuracy - % words not normal / 100- % words not normal
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Einstein: Used Solutions & Critic

A

• I argue that the two main computational approaches to dealing with bad language —
normalization and domain adaptation — are based on theories of social media
language that are not descriptively accurate.
1. Normalization: adapting text to fit the tools
o The logic of normalization presupposes that the “norm” can be identified
unambiguously, and that there is a direct mapping from non-standard words
to the elements in this normal set.
o Normalization is often impossible without changing the meaning of the text.
2. Domain adaption: adapting tools to fit the text
o E.g. preprocessing, new annotation schemes
o By adopting a model of “domain adaptation,” we con- fuse a medium with a
coherent domain. Adapting language technology towards the median Tweet
can improve accuracy on average, but it is certain to leave many forms of
language out.
o Social media is not coherent domain; Twitter itself is not a unified genre, it is
composed of many different styles and registers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Einstein: Lexical coherence of social media

A

• The internal coherence of social media — and its relationship to other types of text —
can be quan- tified in terms of the similarity of distributions over bigrams. The
relationship between OOV rate and do- main adaptation has been explored by
McClosky et al. (2010), who use it as a feature to predict how well a parser will
perform when applied across domains

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Einstein: Oov

A
  • out-of-vocabulary bigrams increases over time
  • steadily increasingly rate of -> we cannot annotate our
    way out of the bad language problem. An NLP system trained from data gathered in
    January 2010 will be increasingly outdated as time passes
  • These rates rise monotonically as the time gap increases -> may reflect the diverse language of the
    different types of authors who post throughout the day.
  • social media = moving target
How well did you know this?
1
Not at all
2
3
4
5
Perfectly