Lecture 3 - Character Encoding Flashcards
1
Q
Encoding Process
A
-
Converting Text to Binary:
- Assign unique numeric values to characters.
- Convert these numbers to binary.
- Example: The text “HELLO” is represented in ASCII as hex values
48 45 4C 4C 4F
.
2
Q
ASCII Encoding
A
- Character Set: Encodes English characters, numbers, and symbols commonly used in US-based digital systems.
- Total Characters: 128 unique characters, each represented by a 7-bit binary number (values 0-127).
Storage:
- In computer memory, characters are stored as 8 bits (1 byte), with the first bit (MSB) set to 0.
3
Q
Pros and Cons - ASCII
A
-
Advantages:
- Simplicity: Fixed-width encoding (8 bits per character) makes it straightforward to read and write.
- Efficiency: Typically results in smaller file sizes and network payloads.
-
Disadvantages:
- Limited Scope: Only encodes English characters, making it unsuitable for non-English or extended character sets.
4
Q
Extended ASCII
A
-
Extended ASCII: Utilizes the 8th bit to add 128 new characters, extending the character set without increasing file size.
-
Popular Extensions:
- ISO 8859-1 (Latin-1): Adds 128 characters, including Latin alphabet symbols used in various European languages. Contains 96 printable characters and 32 control characters.
- Code page 437: Used in IBM PCs, includes additional characters beyond the basic ASCII set.
-
Popular Extensions:
5
Q
ISO 8859-1(Latin-1)
A
- Developed by ISO and IEC to support Latin alphabets used in European languages.
- Character Set: Includes all ASCII characters and additional Latin characters.
6
Q
Pros and Cons - ISO 8859-1
A
-
Advantages:
- Fixed-width: Similar to ASCII, each character is 8 bits.
- Extended Coverage: Supports most European languages with additional Latin characters.
-
Disadvantages:
- Incomplete Coverage: Does not support all languages, even some European languages have incomplete support.
- Non-Latin Languages: Cannot be used for non-Latin languages.
7
Q
UTF-8
A
- UTF-8 is an 8-bit variable-length encoding scheme compatible with ASCII.
- It uses 1 to 4 bytes to encode a character, with each character’s code point derived from the UTF character set.
- As a superset of ASCII, UTF-8 ensures that ASCII characters remain unchanged, making UTF-8 backwards compatible with ASCII.
Key Points:
- Code Units: UTF-8 uses 8-bit (1 byte) code units. ASCII characters (code points 0-127) use a single byte, while non-ASCII characters use multiple bytes.
- Variable-Length Encoding: Characters may take up to 4 bytes, with the number of bytes determined by the leading bits of the first byte.
- Compatibility: ASCII characters have identical representations in UTF-8, facilitating easy reading of ASCII documents by UTF-8 decoders.
- Encoding Non-ASCII Characters: For code points >127, additional bytes are used. Leading bits in the first byte indicate the total number of bytes for that character.
8
Q
Multi Byte Encoding - UTF-8
A
-
Handling Non-ASCII Characters:
- If a character has a code point greater than 127, it requires two or more bytes.
- The number of leading 1 bits in the first byte indicates the total number of bytes for that character. Each subsequent byte in the sequence starts with
10
.
9
Q
Encoding Scheme - UTF-8
A
-
1-byte (ASCII):
- Range: 0 to 127
-
Format:
0xxxxxxx
-
Example: ‘A’ (
U+0041
):01000001
-
2-byte:
- Range: 128 to 2047
-
Format:
110xxxxx 10xxxxxx
-
Example: ‘ñ’ (
U+00F1
):11000011 10110001
-
3-byte:
- Range: 2048 to 65535
-
Format:
1110xxxx 10xxxxxx 10xxxxxx
-
Example: ‘漢’ (
U+6F22
):11100110 10111100 10100010
-
4-byte:
- Range: 65536 to 1114111
-
Format:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
-
Example: ‘𤭢’ (
U+24B62
):11110000 10100100 10101101 10100010
10
Q
Decoding Process - UTF-8
A
-
Single Byte:
- If the first bit is 0, it’s an ASCII character, taking one byte.
- Example:
01000001
→ ‘A’ (simply the binary represenation)
-
Multi-byte:
- If the first bit is 1, the decoder checks the number of leading 1 bits to determine the number of bytes.
- Continuation bytes always start with
10
.