slides26 Flashcards
not much here
very sad if russell asks a presentation layer question at the exam
UCS
UCS (ISO 10646) is a character encoding that uses 31 bits instead of just 7
This gives ample room for all the characters in all the written languages in the world
It is a big table that says “this value represents this character”
Unicode takes UCS and adds details like direction of writing (left-to-right or right-to-left or bidirectional), defining alphabetic orders, which are capital letters, and so on
how many graphemes does Unicode use
Unicode only uses UCS values from 0 to 10FFFF 17 × 216 = 1, 114, 112 code points
what is a glyph
And then there is the glyph, the visible rendering of the grapheme in some font
Unicode Transformation Format 32 UTF-32
simply uses four bytes per character and embeds ASCII in UCS by merely adding three 0 bytes before every ASCII byte
cat in ASCII is three bytes: 99 97 116 catinUTF-32is12bytes: 0009900097000116
UCS-2
Less inflationary is UCS-2, that uses two bytes per character and prepends a single 0 byte before each ASCII character
UTF-16
UTF-16 can represent all Unicode values, but at the cost of some complexity
It uses pairs of 16 bit values in the range D800 to DFFF (surrogate pairs) to encode the extended values
The surrogate values (and which is high and low) can easily be identified in a byte stream: important if you are dipping into the middle of a string
It does punch a hole in Unicode from D800 to DFFF that can’t be used as characters
UTF-8
most popular
An ASCII file is already a UTF-8 file and there is no expansion of data when regarding it as UCS
00000000-0000007F 0xxxxxxx
00000080-000007FF 110xxxxx 10xxxxxx
00000800-0000FFFF 1110xxxx 10xxxxxx 10xxxxxx
00010000-0010FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
how to find start end of characters
• When dipping at random into a UTF-8 encoded file it is easy to find the start of the next character: just search until you find a byte starting with bits 0 or 11
how to find the length of non ascii character
• The length of each non-ASCII character is given by the number of leading 1 bits
Endianness
Endianness refers to the sequential order in which bytes are arranged into larger numerical values when stored in memory or when transmitted over digital links
Punycode
In computing, Punycode is an instance of a general encoding syntax by which a string of Unicode characters is transformed uniquely and reversibly into a smaller, restricted character set. Punycode is intended for the encoding of labels in the Internationalized Domain Names in Applications framework, such that these domain names may be represented in the ASCII character set allowed in the Domain Name System of the Internet. The encoding syntax is defined in IETF document RFC 3492. The IDNA methodology encodes only select label components of domain names with a procedure called ToASCII. The procedure ToUnicode decodes the DNS label into Unicode representation.
Unicode is split into 17 planes
In the Unicode standard, a plane is a continuous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal format (U+hhhhhh). Plane 0 is the Basic Multilingual Plane (BMP), which contains most commonly-used characters. The higher planes 1 through 16 are called “supplementary planes”.[1] The very last code point in Unicode is the last code point in plane 16, U+10FFFF. As of Unicode version 11.0, six of the planes have assigned code points (characters), and four are named.
Plane 0 Basic Multilingual Plane U+0000 to U+FFFF modern languages and special characters. Includes a large number of Chinese, Japanese and Korean (CJK) characters.
Plane 1 Supplementary Multilingual Plane U+10000 to U+1FFFF historic scripts and musical and mathematical symbols
Plane 2 Supplementary Ideographic Plane U+20000 to U+2FFFF rare Chinese characters
Plane 14 Supplementary Special-purpose Plane U+E0000 to U+EFFFF non-recommended language tag and variation selection characters
Plane 15 Supplementary Private Use Area-A U+F0000 to U+FFFFF private use (no character is specified)
Plane 16 Supplementary Private Use Area-B U+100000 to U+10FFFF private use (no character is specified)