Lecture 3 - Character Encoding Flashcards

1
Q

Encoding Process

A
  • Converting Text to Binary:
    • Assign unique numeric values to characters.
    • Convert these numbers to binary.
    • Example: The text “HELLO” is represented in ASCII as hex values 48 45 4C 4C 4F.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

ASCII Encoding

A
  • Character Set: Encodes English characters, numbers, and symbols commonly used in US-based digital systems.
  • Total Characters: 128 unique characters, each represented by a 7-bit binary number (values 0-127).

Storage:

  • In computer memory, characters are stored as 8 bits (1 byte), with the first bit (MSB) set to 0.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Pros and Cons - ASCII

A
  • Advantages:
    • Simplicity: Fixed-width encoding (8 bits per character) makes it straightforward to read and write.
    • Efficiency: Typically results in smaller file sizes and network payloads.
  • Disadvantages:
    • Limited Scope: Only encodes English characters, making it unsuitable for non-English or extended character sets.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Extended ASCII

A
  • Extended ASCII: Utilizes the 8th bit to add 128 new characters, extending the character set without increasing file size.
    • Popular Extensions:
      • ISO 8859-1 (Latin-1): Adds 128 characters, including Latin alphabet symbols used in various European languages. Contains 96 printable characters and 32 control characters.
      • Code page 437: Used in IBM PCs, includes additional characters beyond the basic ASCII set.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

ISO 8859-1(Latin-1)

A
  • Developed by ISO and IEC to support Latin alphabets used in European languages.
  • Character Set: Includes all ASCII characters and additional Latin characters.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Pros and Cons - ISO 8859-1

A
  • Advantages:
    • Fixed-width: Similar to ASCII, each character is 8 bits.
    • Extended Coverage: Supports most European languages with additional Latin characters.
  • Disadvantages:
    • Incomplete Coverage: Does not support all languages, even some European languages have incomplete support.
    • Non-Latin Languages: Cannot be used for non-Latin languages.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

UTF-8

A
  • UTF-8 is an 8-bit variable-length encoding scheme compatible with ASCII.
  • It uses 1 to 4 bytes to encode a character, with each character’s code point derived from the UTF character set.
  • As a superset of ASCII, UTF-8 ensures that ASCII characters remain unchanged, making UTF-8 backwards compatible with ASCII.

Key Points:

  • Code Units: UTF-8 uses 8-bit (1 byte) code units. ASCII characters (code points 0-127) use a single byte, while non-ASCII characters use multiple bytes.
  • Variable-Length Encoding: Characters may take up to 4 bytes, with the number of bytes determined by the leading bits of the first byte.
  • Compatibility: ASCII characters have identical representations in UTF-8, facilitating easy reading of ASCII documents by UTF-8 decoders.
  • Encoding Non-ASCII Characters: For code points >127, additional bytes are used. Leading bits in the first byte indicate the total number of bytes for that character.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Multi Byte Encoding - UTF-8

A
  • Handling Non-ASCII Characters:
    • If a character has a code point greater than 127, it requires two or more bytes.
    • The number of leading 1 bits in the first byte indicates the total number of bytes for that character. Each subsequent byte in the sequence starts with 10.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Encoding Scheme - UTF-8

A
  1. 1-byte (ASCII):
    • Range: 0 to 127
    • Format: 0xxxxxxx
    • Example: ‘A’ (U+0041): 01000001
  2. 2-byte:
    • Range: 128 to 2047
    • Format: 110xxxxx 10xxxxxx
    • Example: ‘ñ’ (U+00F1): 11000011 10110001
  3. 3-byte:
    • Range: 2048 to 65535
    • Format: 1110xxxx 10xxxxxx 10xxxxxx
    • Example: ‘漢’ (U+6F22): 11100110 10111100 10100010
  4. 4-byte:
    • Range: 65536 to 1114111
    • Format: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
    • Example: ‘𤭢’ (U+24B62): 11110000 10100100 10101101 10100010
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Decoding Process - UTF-8

A
  1. Single Byte:
    • If the first bit is 0, it’s an ASCII character, taking one byte.
    • Example: 01000001 → ‘A’ (simply the binary represenation)
  2. Multi-byte:
    • If the first bit is 1, the decoder checks the number of leading 1 bits to determine the number of bytes.
    • Continuation bytes always start with 10.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly