Lecture 3 - Character Encoding Flashcards

Question 1

Q

Encoding Process

Answer

A

Converting Text to Binary:
- Assign unique numeric values to characters.
- Convert these numbers to binary.
- Example: The text “HELLO” is represented in ASCII as hex values 48 45 4C 4C 4F.

Question 2

Q

ASCII Encoding

Answer

A

Character Set: Encodes English characters, numbers, and symbols commonly used in US-based digital systems.
Total Characters: 128 unique characters, each represented by a 7-bit binary number (values 0-127).

Storage:

In computer memory, characters are stored as 8 bits (1 byte), with the first bit (MSB) set to 0.

Question 3

Q

Pros and Cons - ASCII

Answer

A

Advantages:
- Simplicity: Fixed-width encoding (8 bits per character) makes it straightforward to read and write.
- Efficiency: Typically results in smaller file sizes and network payloads.
Disadvantages:
- Limited Scope: Only encodes English characters, making it unsuitable for non-English or extended character sets.

Question 4

Q

Extended ASCII

Answer

A

Extended ASCII: Utilizes the 8th bit to add 128 new characters, extending the character set without increasing file size.
- Popular Extensions:
  - ISO 8859-1 (Latin-1): Adds 128 characters, including Latin alphabet symbols used in various European languages. Contains 96 printable characters and 32 control characters.
  - Code page 437: Used in IBM PCs, includes additional characters beyond the basic ASCII set.

Question 5

Q

ISO 8859-1(Latin-1)

Answer

A

Developed by ISO and IEC to support Latin alphabets used in European languages.
Character Set: Includes all ASCII characters and additional Latin characters.

Question 6

Q

Pros and Cons - ISO 8859-1

Answer

A

Advantages:
- Fixed-width: Similar to ASCII, each character is 8 bits.
- Extended Coverage: Supports most European languages with additional Latin characters.
Disadvantages:
- Incomplete Coverage: Does not support all languages, even some European languages have incomplete support.
- Non-Latin Languages: Cannot be used for non-Latin languages.

Question 7

Q

UTF-8

Answer

A

UTF-8 is an 8-bit variable-length encoding scheme compatible with ASCII.
It uses 1 to 4 bytes to encode a character, with each character’s code point derived from the UTF character set.
As a superset of ASCII, UTF-8 ensures that ASCII characters remain unchanged, making UTF-8 backwards compatible with ASCII.

Key Points:

Code Units: UTF-8 uses 8-bit (1 byte) code units. ASCII characters (code points 0-127) use a single byte, while non-ASCII characters use multiple bytes.
Variable-Length Encoding: Characters may take up to 4 bytes, with the number of bytes determined by the leading bits of the first byte.
Compatibility: ASCII characters have identical representations in UTF-8, facilitating easy reading of ASCII documents by UTF-8 decoders.
Encoding Non-ASCII Characters: For code points >127, additional bytes are used. Leading bits in the first byte indicate the total number of bytes for that character.

Question 8

Q

Multi Byte Encoding - UTF-8

Answer

A

Handling Non-ASCII Characters:
- If a character has a code point greater than 127, it requires two or more bytes.
- The number of leading 1 bits in the first byte indicates the total number of bytes for that character. Each subsequent byte in the sequence starts with 10.

Question 9

Q

Encoding Scheme - UTF-8

Answer

A

1-byte (ASCII):
- Range: 0 to 127
- Format: 0xxxxxxx
- Example: ‘A’ (U+0041): 01000001
2-byte:
- Range: 128 to 2047
- Format: 110xxxxx 10xxxxxx
- Example: ‘ñ’ (U+00F1): 11000011 10110001
3-byte:
- Range: 2048 to 65535
- Format: 1110xxxx 10xxxxxx 10xxxxxx
- Example: ‘漢’ (U+6F22): 11100110 10111100 10100010
4-byte:
- Range: 65536 to 1114111
- Format: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
- Example: ‘𤭢’ (U+24B62): 11110000 10100100 10101101 10100010

Question 10

Q

Decoding Process - UTF-8

Answer

A

Single Byte:
- If the first bit is 0, it’s an ASCII character, taking one byte.
- Example: 01000001 → ‘A’ (simply the binary represenation)
Multi-byte:
- If the first bit is 1, the decoder checks the number of leading 1 bits to determine the number of bytes.
- Continuation bytes always start with 10.

(10 cards)