TEXT PRE PROCESSING Flashcards

Question 1

Q

Before we start text preprocessing what is it important to think about

Answer

A

What genre and domains the text is in
Genres: social media/emails/literature
Domain: Chemistry/politics/entertainment

Because we may need specific resources
Thinks like format and punctuation will change depending

Question 2

Q

What is tokenisation

Answer

A

Breaking input into individual units of text (tokens)
Easiest way is to use whitespace

Question 3

Q

Issue with whitespace

Answer

A

some languages do not use whitespace so it will not tokenise correctly

Question 4

Q

What are the typical tokenisation steps

Answer

A

Initial segmentation (White space)
-handling abbreviations and apostrophes
-handling hyphenations
-dealing with other special exps (URLs)

Question 5

Q

Traditional NLP tokenisation

Answer

A

Uses simple word like tokens

Question 6

Q

Modern NLP tokenisation

Answer

A

Subword tokenisation

Question 7

Q

What is normalisation

Answer

A

Process of standardizing and transforming text data to a common, consistent format
Consists of:
Lowercasing
Removing Punctuation
Removing Stop Words: removing common words (e.g., “the,” “and,” “is”)
Stemming
Lemmatization

Question 8

Q

What is lemmatisation

Answer

A

reduction to “dictionary headword” form
(lemma)
{am, are, is} -> be

Question 9

Q

What is morphological analysis

Answer

A

Words are not the only units of ‘meaning’
subwords = morphemes, have some meaning
(un) (happy) (ness)
(prefix) (stem) (suffix)

So morphological analysis is taking a morpheme and seeing what its stem and affixes are
This can depend on the context

Question 10

Q

What is a ‘derivation’

Answer

A

formation of a word from its stem and suffixes and affixes
eg Un-Happi-Ness
Friend-ly

Question 11

Q

What is inflection

Answer

A

The modification of a word to express different grammatical roles
eg come -> came
waiter -> waitress

Question 12

Q

What is regular inflectional morphology

Answer

A

Changes that occur to express grammatical features like tense, number, gender, case
Easy and predictable

Question 13

Q

What is derivational morphology

Answer

A

Creation of new words by adding affixes or other modifications
changes the word class
teach -> teacher

Question 14

Q

What is stemming

Answer

A

Chops “ends of words”
removes suffixes, sometimes prefixes
quite quick
Can largely neutralise inflection and some derivation
Can yield non-words

Question 15

Q

What is under-stemming

Answer

A

fails to conflate related terms
divide -> divid
division -> divis

Question 16

Q

What is over-stemming

Answer

Study These Flashcards

A

conflates unrelated terms
neutron, neutral -> neutr

Question 17

Q

What is the Porter stemmer

Answer

Study These Flashcards

A

One of the most common stemmers in the English language
rules based
focuses on suffix stripping

Question 18

Q

What is character n-gram tokenisation

Answer

Study These Flashcards

A

Breaks down text into “n” length substring tokens

Question 19

Q

What is Byte-pair encoding

Answer

Study These Flashcards

A

We add _ at the end of all words to know where the words are separated

Let the initial vocabulary {A,B,C…a,b,c…} be the set of individual characters in the corpus
Choose the two symbols which are the most frequently adjacent in the corpus eg ‘e’‘r’

Add a new symbol ‘er’
{…x,y,z,er}
replace all ‘e’‘r’ with ‘er’

Until k merges have been done

Question 20

Q

What is a token learner

Answer

Study These Flashcards

A

takes a raw training corpus and induces a vocabulary
(inventory of tokens)

Question 21

Q

What is a token segmenter

Answer

Study These Flashcards

A

takes a raw test i.e. input sentence and tokenizes it according to that vocabulary

Question 22

Q

What is the result of a BPE token learner

Answer

Study These Flashcards

A

Most words and subwords (affixes?) will be represented as full symbols
{ing, ed, er}
Very rare tokens (including unknown words) will be represented by their parts (subwords)
Can ‘control’ k (the number of merges) as a parameter depending on how many symbols we want in the vocabulary

Question 23

Q

What is a BPE token segmenter

Answer

Study These Flashcards

A

new data, run each merge learnt from
the training data greedily and in the order we’ve learned them
– Merge every e r to er first, then merge er to er_

Words that have not been seen before will be represented by subtokens ‘low’‘er’

BPE reduces the number of unseen tokens (OOV)

TEXT PRE PROCESSING Flashcards

(23 cards)