Unicode Flashcards
writing system/script
a system more or less permanent marks used to represent an utterance in such a way that it can be recovered without the intervention of an utterance
character
the smallest component of a writing system that has a semantic value
grapheme
the smallest sound unit in the spoken language
glyphs
representation of a character as it is displayed (i.e. fonts)
unicode
clear encoding to embrace all the world’s languages & is emerging as the gold standard
design principles of Unicode
universality, efficiency, characters not glyphs, semantics, plain text, logical order, unification, dynamic composition, stability, convertibility
surrogate pairs
an extension mechanism that consists of 2, 16-bit values.
the first value = high surrogate
the second value = low surrogate
what are the advantages and disadvantages of UTF-8
advantages:
- existing ASCII files are in utf-8
- most broadly supported encoding form today
disadvantages:
- ideographic languages required 3 bytes/character so utf-8 encodings are larger than most existing encodings
what are the advantages and disadvantages of UTF-16
advantages:
- allows all Unicode code points to be mapped into 2 bode units (bytes)
disadvantages:
- Latin text = x2 large therefore single-byte encodings
- not backward/forwards compatible with ASCII so programs that expect single-byte character sets won’t work in UTF-16
what are the advantages and disadvantages of UTF-32
advantages:
- simple: allows all code points to be mapped into 1 fixed-length code units
disadvantages:
- Latin texts = x4 large therefore single-byte encodings
- not backward/forwards compatible with ASCII so programs that expect single-byte character sets won’t work in UTF-32
encoding model
3 level model:
- abstract character repertoire
- code space
- encoding forms
code space
mapping to a set of integers, where a particular integer in set is known as the code points
encoding forms
once defined mapping from abstract character set to set of integers further mappings is required.
character encoding form & character encoding scheme
character encoding form
a mapping from a set of integers to a set of sequences of code units of specified width
character encoding scheme
a mapping from a set of sequences of code units to a. serialised sequence of bytes
challenges of character encoding
generality, character set specification, hardware issues, variable/fixed width, interoperability