String concepts Flashcards
UTF-16 characters, Unicode code points, grapheme clusters, and lone surrogates
What is a UTF-16 code unit?
Strings are represented fundamentally as sequences of UTF-16 code units. In UTF-16 encoding, every code unit is exact 16 bits long. This means there are a maximum of Math.pow(2,16)
, or 65536
possible characters representable as single UTF-16 code units. This character set is called the basic multilingual plane (BMP), and includes the most common characters like the Latin, Greek, Cyrillic alphabets, as well as many East Asian characters. Each code unit can be written in a string with \u
followed by exactly four hex digits.
“UTF-16 characters, Unicode code points, and grapheme clusters” (MDN Web Docs). Retrieved February 6, 2024.
What is a Unicode code point?
Unicode is a standard character set that numbers and defines characters from the world’s different languages, writing systems, and symbols.
Each Unicode character, comprised of one or two UTF-16 code units, is also called a Unicode code point. Each Unicode code point can be written in a string with \u{xxxxxx}
where xxxxxx
represents 1–6 hex digits.
“UTF-16 characters, Unicode code points, and grapheme clusters” (MDN Web Docs). Retrieved February 6, 2024.
What is a surrogate pair?
The entire Unicode character set is much, much bigger than 65536
(UTF-16 code units. possible characters). The extra characters are stored in UTF-16 as surrogate pairs, which are pairs of 16-bit code units that represent a single character.
“UTF-16 characters, Unicode code points, and grapheme clusters” (MDN Web Docs). Retrieved February 6, 2024.
Explain the parts of a surrogate pair
The two parts of the pair must be between 0xD800
and 0xDFFF
, and these code units are not used to encode single-code-unit characters. (More precisely, leading surrogates, also called high-surrogate code units, have values between 0xD800
and 0xDBFF
, inclusive, while trailing surrogates, also called low-surrogate code units, have values between 0xDC00
and 0xDFFF
, inclusive.)
“UTF-16 characters, Unicode code points, and grapheme clusters” (MDN Web Docs). Retrieved February 6, 2024.
What is a “lone surrogate”?
A “lone surrogate” is a 16-bit code unit satisfying one of the descriptions below:
- It is in the range
0xD800–0xDBFF
, inclusive (i.e. is a leading surrogate), but it is the last code unit in the string, or the next code unit is not a trailing surrogate. - It is in the range
0xDC00–0xDFFF
, inclusive (i.e. is a trailing surrogate), but it is the first code unit in the string, or the previous code unit is not a leading surrogate.
Lone surrogates do not represent any Unicode character. Although most JavaScript built-in methods handle them correctly because they all work based on UTF-16 code units, lone surrogates are often not valid values when interacting with other systems — for example, encodeURI()
will throw a URIError
for lone surrogates, because URI encoding uses UTF-8 encoding, which does not have any encoding for lone surrogates. Strings not containing any lone surrogates are called well-formed strings, and are safe to be used with functions that do not deal with UTF-16 (such as encodeURI()
or TextEncoder
). You can check if a string is well-formed with the isWellFormed()
method, or sanitize lone surrogates with the toWellFormed()
method.
“String - JavaScript | MDN” (MDN Web Docs). Retrieved February 7, 2024.
What is a “grapheme cluster”?
On top of Unicode characters, there are certain sequences of Unicode characters that should be treated as one visual unit, known as a grapheme cluster. The most common case is emojis: many emojis that have a range of variations are actually formed by multiple emojis, usually joined by the <ZWJ>
(U+200D
) character.
“String - JavaScript | MDN” (MDN Web Docs). Retrieved February 7, 2024.
What do we need to pay attention when we iterate over strings?
You must be careful which level of characters you are iterating on. For example, split("")
will split by UTF-16 code units and will separate surrogate pairs. String indexes also refer to the index of each UTF-16 code unit. On the other hand, @@iterator()
iterates by Unicode code points. Iterating through grapheme clusters will require some custom code.
"😄".split(""); // ['\ud83d', '\ude04']; splits into two lone surrogates // "Backhand Index Pointing Right: Dark Skin Tone" [..."👉🏿"]; // ['👉', '🏿'] // splits into the basic "Backhand Index Pointing Right" emoji and // the "Dark skin tone" emoji // "Family: Man, Boy" [..."👨👦"]; // [ '👨', '', '👦' ] // splits into the "Man" and "Boy" emoji, joined by a ZWJ // The United Nations flag [..."🇺🇳"]; // [ '🇺', '🇳' ] // splits into two "region indicator" letters "U" and "N". // All flag emojis are formed by joining two region indicator letters
“String - JavaScript | MDN” (MDN Web Docs). Retrieved February 7, 2024.
List the 4 levels of character representation at which String
methods and static methods operate.
- UTF-16 Code Unit Level: This level deals with individual units of encoding in UTF-16, which are 16 bits each. Common operations include retrieving a specific character or its code at a given index.
- Unicode Code Point Level: At this level, operations involve working with complete Unicode code points, which can span one or two UTF-16 code units. Methods here often focus on extracting or manipulating characters based on their Unicode code points.
- Grapheme Cluster Level: Grapheme clusters represent a user-perceived character and may consist of one or more Unicode code points. The method helps in handling variations of grapheme clusters, ensuring consistency in character representations.
- String Level: These methods operate on the entire string as a whole and are not necessarily concerned with the internal details of individual characters or code points. They handle common string operations such as concatenation, searching, and manipulation of the entire string.
It’s important to note that JavaScript primarily uses UTF-16 encoding for strings, so many operations are naturally performed at the UTF-16 code unit level. Unicode code points and grapheme clusters become more relevant when dealing with certain types of characters, especially those outside the Basic Multilingual Plane (BMP) or when considering complex characters like emojis and accented characters.