Character Encoding Flashcards
1
Q
Unicode
A
- attempt to represent all text from all languages in a single standard
- to support electronic rendering of all texts and symbols
- each grapheme assigned unique number of code point
- allow different orthographies to co-exist in a single document
2
Q
Text documents
A
- represented as series of numbers
- simplest form of encoding through fixed precision ie. fixed number of digits to represent code point for each character
3
Q
ASCII
A
- 128 characters
- 7 bit
- compact but can only encode small number of characters
4
Q
UTF-32
A
- 32 bit encoding
- can encode all unicode characters but bloated
5
Q
ISO-8859
A
- single byte encoding built on top of ASCII to include extra 128 characters
- can represent orthographies such as Thai, unable to support big orthographies e.g. Japanese
6
Q
Variable-width encoding
A
- variable bytes
- encode code points using variable number of code units of fixed size
e. g. UTF-8, UTF-16
7
Q
UTF-8
A
- 8 bit, variable-width encoding
- compatible with ASCII, superset of ASCII
- character boundaries easily locatable, continuation bytes always start with 10
- used to represent unicode strings
8
Q
Declaring character encoding
A
- manually specify character encoding in document e..g charset = ISO8859-8
- automatically detects character encoding in terms of compatibility, user preferences, statistical model