Tokenisation Flashcards
1
Q
Describe the task of tokenisation
A
Break down a piece of text into individual words
2
Q
What is the initial approach to tokenisation
A
whitespace
3
Q
Where does ambiguity come from in tokenisation
A
end of sentences “mary.”, abbreviations “dr.”, punctuation “they’re”
4
Q
What are the three classes of token
A
morphosyntactic word, punctuation or symbol, number
5
Q
What is the challenge of character encoding
A
Do we choose ascii only or unicode to include emojis.
6
Q
What are the challenges of tokenisation (7)
A
- character encoding
- transliterations
- poor ocr results
- writing systems
- hyphenation
- telephone numbers
- dates, decimals, money
7
Q
What challenges arise from domain dependence (3)
A
- organism/species names/authorities, families, orders
- coordinates
- protein sequence, protein names