Tokenization - Week 2 Flashcards

Question 1

Q

Tokenization

Answer

A

Break input into basic units

Words, numbers, punctuation, emoji

No firm rules, should be consistent with the rest of an NLP system

It’s about knowing when to split, not when to combine
- Avoid over-segmentation

Is only the first step, should be simple, but will affect the rest of the steps

Question 2

Q

Does whitespace always work for tokenization?

Answer

A

No, not all languages use spaces between tokens,

Other complications include right to left / mixed direction languages like arabic

Japanese uses several alphabets

Question 3

Q

Tokenization things to deal with

Answer

A

Hyphens

Should places be one token “San Francisco”

In space of - 1,2 or 3 tokens?

Might tokenise on Lx71

Abbreviated forms - can’t, what’re, I’m
This has a problem because of possessive apostrophe like King’s speech shouldn’t be King is speech

Need to make decisions depending on the context and what you want to do

Question 4

Q

Typical tokenisation steps

Answer

A

Initial Segmentation
Handling abbreviations and apostrophes
Handling hyphenation
Dealing with (other) special expressions

Question 5

Q

All punctuation that is not part of an acronym should be a seperate token?

Question 6

Q

Types of apostrophe

Answer

A

Quotative Markers - ‘All Quiet on the Western Front’

genitive markers - e.g. Oliver’s Book

enclitics
she’s -> she has or she is

Question 7

Q

Enclitics

Answer

A

Abbreviated forms typically of auxiliary verbs (be, will) that is pronounced with so little emphasis that it is shortened and forms part of the preceding word
e.g. I’m, you’re she’s, can’t

Question 8

Q

Lexical hyphens

Answer

A

A true hyphen used in compound words which have made their way into the standard vocabulary (and should be kept)

meta-analysis
multi-disciplinary
self-assessment

Some hyphens are not lexical, like: UK-based

Question 9

Q

Tokenization special expression examples

Answer

A

emails
dates
Numbers (different in different languages)
measures
vehicle license numbers
…

Question 10

Q

Truecasing

Answer

A

lowercase words at the beginning of sentences, leave mid-sentence capitalised words as capitalised

Question 11

Q

Case folding

Answer

A

Should we true case, make everything lower case? Some proper-nouns only identifiable with casing, e.g. US might become us

search engines - users usually lowercase everything regardless of the correct case of words. Thus, lowercasing everything seems a practical solution

Identification of entity name (e.g. organisation or people names): preserving capitals would make sense

Also consider special characters, like accents, emojis, etc…
Users might not use an accent when googling something, even if there should technically be one

Question 12

Q

Language model

Answer

A

A function (often probabilistic) that assigns a probability over a piece of text so that ‘natural’ pieces have a larger probability

P(“the the the the the”) - pretty small
P(“the cat in the hat”) - pretty big

Question 13

Q

Model

Answer

A

A “simplified”, abstract representation of something, often in a computational form
e.g. tossing a coin
e.g. probabilistic model of weather forecast

Question 14

Q

BOW

Answer

A

A Bag Of Words representation reduces each document into a bag of words

Bag of terms, bag of tokens, bag of stems

Problems:
- Meaning is lost without order, e.g. negations
- Not all words are equally important
- Meaning is lost without context, introduces ambiguities
- Doesn’t work for all languages

It is, however, efficient

Could skip stop words, rank words based on a metric

Question 15

Q

Zipf’s Law

Answer

A

Frequency of any word in a given collection is inversely proportional to its rank in the frequency table

Question 16

Q

Luhn’s hypothesis

Answer

Study These Flashcards

A

Words with frequency below a low cut-off are rare, and therefore not contributing significantly to the content of the article.

Words exceeding a certain frequency are to considered too common.

Question 17

Q

Vector representation

Answer

Study These Flashcards

A

Represent a document as a vector by introducing a vocabulary (set of terms left after pre-processing)

Document is represented by a |V| dimensional vector, where each entry is a weighted representation of a specific term in the dictionary

Millions of dimensions
Most values are 0 - very sparse

Question 18

Q

Vector Representation Weights

Answer

Study These Flashcards

A

Incidence (0 or 1)

Frequency
- Almost all documents have many determiners
- Rare terms more descriptive than frequent terms
- Documents have different lengths which affects the counts
- Raw term frequency is not what we want

Question 19

Q

Answer

Study These Flashcards

A

Tokenization - Week 2 Flashcards

(19 cards)