slides26 Flashcards

1
Q

not much here

A

very sad if russell asks a presentation layer question at the exam

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

UCS

A

UCS (ISO 10646) is a character encoding that uses 31 bits instead of just 7
This gives ample room for all the characters in all the written languages in the world
It is a big table that says “this value represents this character”
Unicode takes UCS and adds details like direction of writing (left-to-right or right-to-left or bidirectional), defining alphabetic orders, which are capital letters, and so on

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

how many graphemes does Unicode use

A

Unicode only uses UCS values from 0 to 10FFFF 17 × 216 = 1, 114, 112 code points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is a glyph

A

And then there is the glyph, the visible rendering of the grapheme in some font

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Unicode Transformation Format 32 UTF-32

A

simply uses four bytes per character and embeds ASCII in UCS by merely adding three 0 bytes before every ASCII byte

cat in ASCII is three bytes: 99 97 116 catinUTF-32is12bytes: 0009900097000116

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

UCS-2

A

Less inflationary is UCS-2, that uses two bytes per character and prepends a single 0 byte before each ASCII character

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

UTF-16

A

UTF-16 can represent all Unicode values, but at the cost of some complexity

It uses pairs of 16 bit values in the range D800 to DFFF (surrogate pairs) to encode the extended values

The surrogate values (and which is high and low) can easily be identified in a byte stream: important if you are dipping into the middle of a string
It does punch a hole in Unicode from D800 to DFFF that can’t be used as characters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

UTF-8

A

most popular

An ASCII file is already a UTF-8 file and there is no expansion of data when regarding it as UCS

00000000-0000007F 0xxxxxxx
00000080-000007FF 110xxxxx 10xxxxxx
00000800-0000FFFF 1110xxxx 10xxxxxx 10xxxxxx
00010000-0010FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

how to find start end of characters

A

• When dipping at random into a UTF-8 encoded file it is easy to find the start of the next character: just search until you find a byte starting with bits 0 or 11

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

how to find the length of non ascii character

A

• The length of each non-ASCII character is given by the number of leading 1 bits

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Endianness

A

Endianness refers to the sequential order in which bytes are arranged into larger numerical values when stored in memory or when transmitted over digital links

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Punycode

A

In computing, Punycode is an instance of a general encoding syntax by which a string of Unicode characters is transformed uniquely and reversibly into a smaller, restricted character set. Punycode is intended for the encoding of labels in the Internationalized Domain Names in Applications framework, such that these domain names may be represented in the ASCII character set allowed in the Domain Name System of the Internet. The encoding syntax is defined in IETF document RFC 3492. The IDNA methodology encodes only select label components of domain names with a procedure called ToASCII. The procedure ToUnicode decodes the DNS label into Unicode representation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Unicode is split into 17 planes

A

In the Unicode standard, a plane is a continuous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal format (U+hhhhhh). Plane 0 is the Basic Multilingual Plane (BMP), which contains most commonly-used characters. The higher planes 1 through 16 are called “supplementary planes”.[1] The very last code point in Unicode is the last code point in plane 16, U+10FFFF. As of Unicode version 11.0, six of the planes have assigned code points (characters), and four are named.

Plane 0 Basic Multilingual Plane U+0000 to U+FFFF modern languages and special characters. Includes a large number of Chinese, Japanese and Korean (CJK) characters.
Plane 1 Supplementary Multilingual Plane U+10000 to U+1FFFF historic scripts and musical and mathematical symbols
Plane 2 Supplementary Ideographic Plane U+20000 to U+2FFFF rare Chinese characters
Plane 14 Supplementary Special-purpose Plane U+E0000 to U+EFFFF non-recommended language tag and variation selection characters
Plane 15 Supplementary Private Use Area-A U+F0000 to U+FFFFF private use (no character is specified)
Plane 16 Supplementary Private Use Area-B U+100000 to U+10FFFF private use (no character is specified)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly