Tokenization - Week 2 Flashcards
Tokenization
Break input into basic units
Words, numbers, punctuation, emoji
No firm rules, should be consistent with the rest of an NLP system
It’s about knowing when to split, not when to combine
- Avoid over-segmentation
Is only the first step, should be simple, but will affect the rest of the steps
Does whitespace always work for tokenization?
No, not all languages use spaces between tokens,
Other complications include right to left / mixed direction languages like arabic
Japanese uses several alphabets
Tokenization things to deal with
Hyphens
Should places be one token “San Francisco”
In space of - 1,2 or 3 tokens?
Might tokenise on Lx71
Abbreviated forms - can’t, what’re, I’m
This has a problem because of possessive apostrophe like King’s speech shouldn’t be King is speech
Need to make decisions depending on the context and what you want to do
Typical tokenisation steps
- Initial Segmentation
- Handling abbreviations and apostrophes
- Handling hyphenation
- Dealing with (other) special expressions
All punctuation that is not part of an acronym should be a seperate token?
True
Types of apostrophe
Quotative Markers - ‘All Quiet on the Western Front’
genitive markers - e.g. Oliver’s Book
enclitics
she’s -> she has or she is
Enclitics
Abbreviated forms typically of auxiliary verbs (be, will) that is pronounced with so little emphasis that it is shortened and forms part of the preceding word
e.g. I’m, you’re she’s, can’t
Lexical hyphens
A true hyphen used in compound words which have made their way into the standard vocabulary (and should be kept)
meta-analysis
multi-disciplinary
self-assessment
Some hyphens are not lexical, like: UK-based
Tokenization special expression examples
emails
dates
Numbers (different in different languages)
measures
vehicle license numbers
…
Truecasing
lowercase words at the beginning of sentences, leave mid-sentence capitalised words as capitalised
Case folding
Should we true case, make everything lower case? Some proper-nouns only identifiable with casing, e.g. US might become us
search engines - users usually lowercase everything regardless of the correct case of words. Thus, lowercasing everything seems a practical solution
Identification of entity name (e.g. organisation or people names): preserving capitals would make sense
Also consider special characters, like accents, emojis, etc…
Users might not use an accent when googling something, even if there should technically be one
Language model
A function (often probabilistic) that assigns a probability over a piece of text so that ‘natural’ pieces have a larger probability
P(“the the the the the”) - pretty small
P(“the cat in the hat”) - pretty big
Model
A “simplified”, abstract representation of something, often in a computational form
e.g. tossing a coin
e.g. probabilistic model of weather forecast
BOW
A Bag Of Words representation reduces each document into a bag of words
Bag of terms, bag of tokens, bag of stems
Problems:
- Meaning is lost without order, e.g. negations
- Not all words are equally important
- Meaning is lost without context, introduces ambiguities
- Doesn’t work for all languages
It is, however, efficient
Could skip stop words, rank words based on a metric
Zipf’s Law
Frequency of any word in a given collection is inversely proportional to its rank in the frequency table