Tokenisation Flashcards

1
Q

Describe the task of tokenisation

A

Break down a piece of text into individual words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the initial approach to tokenisation

A

whitespace

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Where does ambiguity come from in tokenisation

A

end of sentences “mary.”, abbreviations “dr.”, punctuation “they’re”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the three classes of token

A

morphosyntactic word, punctuation or symbol, number

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the challenge of character encoding

A

Do we choose ascii only or unicode to include emojis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the challenges of tokenisation (7)

A
  • character encoding
  • transliterations
  • poor ocr results
  • writing systems
  • hyphenation
  • telephone numbers
  • dates, decimals, money
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What challenges arise from domain dependence (3)

A
  • organism/species names/authorities, families, orders
  • coordinates
  • protein sequence, protein names
How well did you know this?
1
Not at all
2
3
4
5
Perfectly