TEXT PRE PROCESSING Flashcards

1
Q

Before we start text preprocessing what is it important to think about

A

What genre and domains the text is in
Genres: social media/emails/literature
Domain: Chemistry/politics/entertainment

Because we may need specific resources
Thinks like format and punctuation will change depending

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is tokenisation

A

Breaking input into individual units of text (tokens)
Easiest way is to use whitespace

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Issue with whitespace

A

some languages do not use whitespace so it will not tokenise correctly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the typical tokenisation steps

A
  • Initial segmentation (White space)
    -handling abbreviations and apostrophes
    -handling hyphenations
    -dealing with other special exps (URLs)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Traditional NLP tokenisation

A

Uses simple word like tokens

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Modern NLP tokenisation

A

Subword tokenisation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is normalisation

A

Process of standardizing and transforming text data to a common, consistent format
Consists of:
Lowercasing
Removing Punctuation
Removing Stop Words: removing common words (e.g., “the,” “and,” “is”)
Stemming
Lemmatization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is lemmatisation

A

reduction to “dictionary headword” form
(lemma)
{am, are, is} -> be

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is morphological analysis

A

Words are not the only units of ‘meaning’
subwords = morphemes, have some meaning
(un) (happy) (ness)
(prefix) (stem) (suffix)

So morphological analysis is taking a morpheme and seeing what its stem and affixes are
This can depend on the context

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a ‘derivation’

A

formation of a word from its stem and suffixes and affixes
eg Un-Happi-Ness
Friend-ly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is inflection

A

The modification of a word to express different grammatical roles
eg come -> came
waiter -> waitress

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is regular inflectional morphology

A

Changes that occur to express grammatical features like tense, number, gender, case
Easy and predictable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is derivational morphology

A

Creation of new words by adding affixes or other modifications
changes the word class
teach -> teacher

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is stemming

A

Chops “ends of words”
removes suffixes, sometimes prefixes
quite quick
Can largely neutralise inflection and some derivation
Can yield non-words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is under-stemming

A

fails to conflate related terms
divide -> divid
division -> divis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is over-stemming

A

conflates unrelated terms
neutron, neutral -> neutr

17
Q

What is the Porter stemmer

A

One of the most common stemmers in the English language
rules based
focuses on suffix stripping

18
Q

What is character n-gram tokenisation

A

Breaks down text into “n” length substring tokens

19
Q

What is Byte-pair encoding

A

We add _ at the end of all words to know where the words are separated

Let the initial vocabulary {A,B,C…a,b,c…} be the set of individual characters in the corpus
Choose the two symbols which are the most frequently adjacent in the corpus eg ‘e’‘r’

Add a new symbol ‘er’
{…x,y,z,er}
replace all ‘e’‘r’ with ‘er’

Until k merges have been done

20
Q

What is a token learner

A

takes a raw training corpus and induces a vocabulary
(inventory of tokens)

21
Q

What is a token segmenter

A

takes a raw test i.e. input sentence and tokenizes it according to that vocabulary

22
Q

What is the result of a BPE token learner

A

Most words and subwords (affixes?) will be represented as full symbols
{ing, ed, er}
Very rare tokens (including unknown words) will be represented by their parts (subwords)
Can ‘control’ k (the number of merges) as a parameter depending on how many symbols we want in the vocabulary

23
Q

What is a BPE token segmenter

A

new data, run each merge learnt from
the training data greedily and in the order we’ve learned them
– Merge every e r to er first, then merge er to er_

Words that have not been seen before will be represented by subtokens ‘low’‘er’

BPE reduces the number of unseen tokens (OOV)