slides26 Flashcards

Question 1

Q

not much here

Answer

A

very sad if russell asks a presentation layer question at the exam

Question 2

Q

UCS

Answer

A

UCS (ISO 10646) is a character encoding that uses 31 bits instead of just 7
This gives ample room for all the characters in all the written languages in the world
It is a big table that says “this value represents this character”
Unicode takes UCS and adds details like direction of writing (left-to-right or right-to-left or bidirectional), defining alphabetic orders, which are capital letters, and so on

Question 3

Q

how many graphemes does Unicode use

Answer

A

Unicode only uses UCS values from 0 to 10FFFF 17 × 216 = 1, 114, 112 code points

Question 4

Q

what is a glyph

Answer

A

And then there is the glyph, the visible rendering of the grapheme in some font

Question 5

Q

Unicode Transformation Format 32 UTF-32

Answer

A

simply uses four bytes per character and embeds ASCII in UCS by merely adding three 0 bytes before every ASCII byte

cat in ASCII is three bytes: 99 97 116 catinUTF-32is12bytes: 0009900097000116

Question 6

Q

UCS-2

Answer

A

Less inflationary is UCS-2, that uses two bytes per character and prepends a single 0 byte before each ASCII character

Question 7

Q

UTF-16

Answer

A

UTF-16 can represent all Unicode values, but at the cost of some complexity

It uses pairs of 16 bit values in the range D800 to DFFF (surrogate pairs) to encode the extended values

The surrogate values (and which is high and low) can easily be identified in a byte stream: important if you are dipping into the middle of a string
It does punch a hole in Unicode from D800 to DFFF that can’t be used as characters

Question 8

Q

UTF-8

Answer

A

most popular

An ASCII file is already a UTF-8 file and there is no expansion of data when regarding it as UCS

00000000-0000007F 0xxxxxxx
00000080-000007FF 110xxxxx 10xxxxxx
00000800-0000FFFF 1110xxxx 10xxxxxx 10xxxxxx
00010000-0010FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Question 9

Q

how to find start end of characters

Answer

A

• When dipping at random into a UTF-8 encoded file it is easy to find the start of the next character: just search until you find a byte starting with bits 0 or 11

Question 10

Q

how to find the length of non ascii character

Answer

A

• The length of each non-ASCII character is given by the number of leading 1 bits

Question 11

Q

Endianness

Answer

A

Endianness refers to the sequential order in which bytes are arranged into larger numerical values when stored in memory or when transmitted over digital links

Question 12

Q

Punycode

Answer

A

In computing, Punycode is an instance of a general encoding syntax by which a string of Unicode characters is transformed uniquely and reversibly into a smaller, restricted character set. Punycode is intended for the encoding of labels in the Internationalized Domain Names in Applications framework, such that these domain names may be represented in the ASCII character set allowed in the Domain Name System of the Internet. The encoding syntax is defined in IETF document RFC 3492. The IDNA methodology encodes only select label components of domain names with a procedure called ToASCII. The procedure ToUnicode decodes the DNS label into Unicode representation.

Question 13

Q

Unicode is split into 17 planes

Answer

A

In the Unicode standard, a plane is a continuous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal format (U+hhhhhh). Plane 0 is the Basic Multilingual Plane (BMP), which contains most commonly-used characters. The higher planes 1 through 16 are called “supplementary planes”.[1] The very last code point in Unicode is the last code point in plane 16, U+10FFFF. As of Unicode version 11.0, six of the planes have assigned code points (characters), and four are named.

Plane 0 Basic Multilingual Plane U+0000 to U+FFFF modern languages and special characters. Includes a large number of Chinese, Japanese and Korean (CJK) characters.
Plane 1 Supplementary Multilingual Plane U+10000 to U+1FFFF historic scripts and musical and mathematical symbols
Plane 2 Supplementary Ideographic Plane U+20000 to U+2FFFF rare Chinese characters
Plane 14 Supplementary Special-purpose Plane U+E0000 to U+EFFFF non-recommended language tag and variation selection characters
Plane 15 Supplementary Private Use Area-A U+F0000 to U+FFFFF private use (no character is specified)
Plane 16 Supplementary Private Use Area-B U+100000 to U+10FFFF private use (no character is specified)

slides26 Flashcards

(13 cards)