Character sets Flashcards
What is ASCII
It’s an agreed upon mapping between the english alphabet and western characters and numbers from 0 to 127.
The numbers can be converted to 8 bit representations of the letters of the alphabet as well as western characters such as ! OR & etc. Each character is a combination of 8 0s and 1s (bits). Remember 8 bits is a byte, meaning each character is 1 byte.
What is unicode
It’s meant to encompass 1000s of characters from all the different languages, and also characters such as emojis. Also accents on letters.
These characters are called graphemes
Graphemes are represented by code points. Each grapheme can be one or more code points. Code points have a numeric value and a string name.
What is a grapheme
A single unit of a human writing system, like an emoji, or a letter, or chinese character.
Explain the problem with many programming language string manipulation libraries - including Javascript
Let’s use string.length
. The language measures the number of bytes, which is fine for ASCII, that is western languages and characters, but not for unicode. In ASCII one character is a byte, but in Unicode a grapheme can be 1 - 4 bytes per code point and it can have multiple code points.
These are called unicode unaware functions.