Regular Expressions Flashcards

1
Q

What is a regular expression?

A

Regular expressions are patterns used to match character combinations in strings.

In JavaScript, regular expressions are also objects. These patterns are used with the exec() and test() methods of RegExp, and with the match(), matchAll(), replace(), replaceAll(), search(), and split() methods of String.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How can you create a regular expression?

A

The two main ways of creating regular expressions are:

  • Literal: compiled statically (at load time).
/abc/iv
  • Constructor: compiled dynamically (at runtime).
new RegExp('abc', 'iv')

Both regular expressions have the same two parts:

  • The body abc – the actual regular expression.
  • The flags i and v. Flags configure how the pattern is interpreted. For example, i enables case-insensitive matching and v enables Unicode sets mode.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How can you clone a regular expression?

A

There are two variants of the constructor RegExp():

  • new RegExp(pattern : string, flags = '') // [ES3] - A new regular expression is created as specified via pattern. If flags is missing, the empty string '' is used.
  • new RegExp(regExp : RegExp, flags = regExp.flags) [ES6] - regExp is cloned. If flags is provided, then it determines the flags of the clone.

The second variant is useful for cloning regular expressions, optionally while modifying them. Flags are immutable and this is the only way of changing them – for example:

function copyAndAddFlags(regExp, flagsToAdd='') {
  // The constructor doesn’t allow duplicate flags;
  // make sure there aren’t any:
  const newFlags = Array.from(
    new Set(regExp.flags + flagsToAdd)
  ).join('');
  return new RegExp(regExp, newFlags);
}
assert.equal(/abc/i.flags, 'i');
assert.equal(copyAndAddFlags(/abc/i, 'g').flags, 'gi');
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the syntax characters of a regular expression?

A

At the top level of a regular expression, the following characters aren special and are know as syntax characters. They are escaped by prefixing a backslash (\).

\ ^ $ . * + ? ( ) [ ] { } |

In regular expression literals, we must escape slashes:

> /\//.test('/')
true

In the argument of new RegExp(), we don’t have to escape slashes:

> new RegExp('/').test('/')
true

“Syntax characters” (exploringjs.com). Retrieved February 18, 2025.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

When is it illegal to scape a non-syntax character?

A

Without flag /u and /v, an escaped non-syntax character at the top level matches itself:

> /^\a$/.test('a')
true

With flag /u or /v, escaping a non-syntax character at the top level is a syntax error:

assert.throws(
  () => eval(String.raw`/\a/v`),
  {
    name: 'SyntaxError',
    message: 'Invalid regular expression: /\\a/v: Invalid escape',
  }
);
assert.throws(
  () => eval(String.raw`/\-/v`),
  {
    name: 'SyntaxError',
    message: 'Invalid regular expression: /\\-/v: Invalid escape',
  }
);
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are character classes [. . .]?

A

A character class wraps class ranges in square brackets. The class ranges specify a set of characters:

  • [«class ranges»] matches any character in the set.
  • [^«class ranges»] matches any character not in the set.

Rules for class ranges:

  • Non-syntax characters stand for themselves: [abc]
  • Only the following four characters are special and must be escaped via slashes: ^ \ - ]
    • ^ only has to be escaped if it comes first.
    • - need not be escaped if it comes first or last.
  • Character escapes (\n, \u{1F44D}, etc.) have the usual meaning.
  • Character class escapes (\d, \P{White_Space}, \p{RGI_Emoji}, etc.) have the usual meanings.
  • Ranges of characters are specified via dashes: [a-z]

Watch out: \b stands for backspace. Elsewhere in a regular expression, it matches word boundaries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the scaping rules inside character classes [. . . ]?

A

Rules for escaping inside character classes without flag /v:

We always must escape: \ ]
Some characters only have to be escaped in some locations:
* - only has to be escaped if it doesn’t come first or last.
* ^ only has to be escaped if it comes first.

Rules with flag /v:

A single ^ only has to be escaped if it comes first.

Class set syntax characters have to be escaped:

( ) [ ] { } / - \ |

Class set reserved double punctuators have to be escaped:

&& !! ## \$\$ %% ** ++ ,, .. :: ;; << == >> ?? @@ ^^ `` ~~
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are syntax atoms of regular expressions?

A

Atoms are the basic building blocks of regular expressions.

  • Pattern characters are all characters except syntax characters (^, $, etc.). Pattern characters match themselves. Examples: A b %
  • . matches any character. We can use the flag /s (dotAll) to control if the dot matches line terminators or not.
  • Character escapes (each escape matches a single fixed character). For example:
    • \f: form feed (FF)
    • \n: line feed (LF)
    • \r: carriage return (CR)
    • \t: character tabulation
    • \v: line tabulation
    • Arbitrary control characters: \cA (Ctrl-A), …, \cZ (Ctrl-Z)
    • Unicode code units: \u00E4
    • Unicode code points (require flag /u or /v): \u{1F44D}
  • Character class escapes define sets of characters (or character sequences) that match:
    • Basic character class escapes define sets of characters: \d \D \s \S \w \W
    • Unicode character property escapes [ES2018] define sets of code points: \p{White_Space}, \P{White_Space}, etc.
      • Require flag /u or /v.
    • Unicode string property escapes [ES2024] define sets of code point sequences: \p{RGI_Emoji}, etc.
      • Require flag /v.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What do the following character class escapes (sets of code units) do: \d \D \s \S \w \W ?

A

\d → Matches any digit (equivalent to [0-9]).
\D → Matches any non-digit (equivalent to [^0-9]).
\s → Matches any whitespace character (spaces, tabs, line terminators, etc.).
\S → Matches any non-whitespace character.
\w → Matches any “word” character (equivalent to [a-zA-Z0-9_]).
\W → Matches any non-word character (equivalent to [^a-zA-Z0-9_]).

Examples:

> 'a7x4'.match(/\d/g)
[ '7', '4' ]
> 'a7x4'.match(/\D/g)
[ 'a', 'x' ]
> 'high - low'.match(/\w+/g)
[ 'high', 'low' ]
> 'hello\t\n everyone'.replaceAll(/\s/g, '-')
'hello---everyone'
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are unicode character properties?

A

In the Unicode standard, each character has properties – metadata describing it. Properties play an important role in defining the nature of a character.

These are a few examples of properties:
* Name: a unique name, composed of uppercase letters, digits, hyphens, and spaces – for example:
* A: Name = LATIN CAPITAL LETTER A
* 🙂: Name = SLIGHTLY SMILING FACE

  • General_Category: categorizes characters – for example:
    • x: General_Category = Lowercase_Letter
    • $: General_Category = Currency_Symbol
  • White_Space: used for marking invisible spacing characters, such as spaces, tabs and newlines – for example:
    • \t: White_Space = True
    • π: White_Space = False
  • Age: version of the Unicode Standard in which a character was introduced – for example: The Euro sign was added in version 2.1 of the Unicode standard.
    • €: Age = 2.1
  • Block: a contiguous range of code points. Blocks don’t overlap and their names are unique. For example:
    • S: Block = Basic_Latin (range 0x0000..0x007F)
    • 🙂: Block = Emoticons (range 0x1F600..0x1F64F)
  • Script: is a collection of characters used by one or more writing systems.
    • Some scripts support several writing systems. For example, the Latin script supports the writing systems English, French, German, Latin, etc.
    • Some languages can be written in multiple alternate writing systems that are supported by multiple scripts. For example, Turkish used the Arabic script before it transitioned to the Latin script in the early 20th century.
    • Examples:
      • α: Script = Greek
      • Д: Script = Cyrillic
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are unicode character property escapes?

A

With flag /u and flag /v, we can use \p{} and \P{} to specify sets of code points via Unicode character properties. That looks like this:

  1. \p{prop=value}: matches all characters whose Unicode character property prop has the value value.
  2. \P{prop=value}: matches all characters that do not have a Unicode character property prop whose value is value.
  3. \p{bin_prop}: matches all characters whose binary Unicode character property bin_prop is True.
  4. \P{bin_prop}: matches all characters whose binary Unicode character property bin_prop is False.
    Comments:

Without the flags /u and /v, \p is the same as p.

Forms (3) and (4) can be used as abbreviations if the property is General_Category. For example, the following two escapes are equivalent:

\p{Uppercase_Letter}
\p{General_Category=Uppercase_Letter}

Examples:

Checking for whitespace:

> /^\p{White_Space}+$/u.test('\t \n\r')
true

Checking for Greek letters:

> /^\p{Script=Greek}+$/u.test('μετά')
true

Deleting any letters:

> '1π2ü3é4'.replace(/\p{Letter}/ug, '')
'1234'

Deleting lowercase letters:

> 'AbCdEf'.replace(/\p{Lowercase_Letter}/ug, '')
'ACE'
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are unicode string property scapes?

A

With /u, we can use Unicode property escapes (\p{} and \P{}) to specify sets of code points via Unicode character properties.

With /v, we can additionally use \p{} to specify sets of code point sequences via Unicode string properties (negation via \P{} is not supported):

> /^\p{RGI_Emoji}$/v.test('⛔') // 1 code point (1 code unit)
true
> /^\p{RGI_Emoji}$/v.test('🙂') // 1 code point (2 code units)
true
> /^\p{RGI_Emoji}$/v.test('😵‍💫') // 3 code points
true

Let’s see how the character property Emoji would do with these inputs:

> /^\p{Emoji}$/u.test('⛔') // 1 code point (1 code unit)
true
> /^\p{Emoji}$/u.test('🙂') // 1 code point (2 code units)
true
> /^\p{Emoji}$/u.test('😵‍💫') // 3 code points
false
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Regexp syntax quantifiers

A

By default, all of the following quantifiers are greedy (they match as many characters as possible):

  • ?: match never or once
  • *: match zero or more times
  • +: match one or more times
  • {n}: match n times
  • {n,}: match n or more times
  • {n,m}: match at least n times, at most m times.

To make them reluctant (so that they match as few characters as possible), put question marks (?) after them:

> /".*"/.exec('"abc"def"')[0]  // greedy
'"abc"def"'
> /".*?"/.exec('"abc"def"')[0] // reluctant
'"abc"'
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Regexp syntax assertions

A
  • ^ matches only at the beginning of the input
  • $ matches only at the end of the input
  • \b matches only at a word boundary
    • \B matches only when not at a word boundary

Syntax: assertions” (exploringjs.com).](https://exploringjs.com/js/book/ch_regexps.html#syntax-assertions) Retrieved February 28, 2025.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are lookaround assertions

A

Lookaround assertions are special types of assertions in regular expressions that allow you to match a pattern based on what comes before (lookbehind) or after (lookahead) it, without including those parts in the match

Positive lookahead: (?=«pattern») matches if pattern matches what comes next.

Example: sequences of lowercase letters that are followed by an X.

> 'abcX def'.match(/[a-z]+(?=X)/g)
[ 'abc' ]

Note that the X itself is not part of the matched substring.

Negative lookahead: (?!«pattern») matches if pattern does not match what comes next.

Example: sequences of lowercase letters that are not followed by an X.

> 'abcX def'.match(/[a-z]+(?!X)/g)
[ 'ab', 'def' ]

Positive lookbehind: (?<=«pattern») matches if pattern matches what came before.

Example: sequences of lowercase letters that are preceded by an X.

> 'Xabc def'.match(/(?<=X)[a-z]+/g)
[ 'abc' ]

Negative lookbehind: (?<!«pattern») matches if pattern does not match what came before.

Example: sequences of lowercase letters that are not preceded by an X.

> 'Xabc def'.match(/(?<!X)[a-z]+/g)
[ 'bc', 'def' ]

Example: replace “.js” with “.html”, but not in “Node.js”.

> 'Node.js: index.js and main.js'.replace(/(?<!Node)\.js/g, '.html')
'Node.js: index.html and main.html'
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Explain regexp syntax disjunction (|)

A
  • ^aa|zz$ - matches all strings that start with 'aa' and/or end with 'zz'.
    • Note that | has a lower precedence than ^ and $.
  • ^(aa|zz)$ - matches the two strings 'aa' and 'zz'.
  • ^a(a|z)z$ - matches the two strings 'aaz' and 'azz'.

Caveat: this operator has low precedence. Use groups if necessary:

17
Q

Explain regexp /i (.ignoreCase) flag

A

/i (.ignoreCase) flag switches on case-insensitive matching:

> /a/.test('A')
false
> /a/i.test('A')
true
18
Q

Explain regexp /g (.global) flag

A

/g (.global) flag fundamentally changes how the following methods work.

RegExp.prototype.test()
RegExp.prototype.exec()
String.prototype.match()

In a nutshell, without /g, the methods only consider the first match for a regular expression in an input string. With /g, they consider all matches.

19
Q

Explain regexp /d (.hasIndices) flag

A

Some RegExp-related methods return match objects that describe where the regular expression matched in an input string. If the /d (.hasIndices) flag is on, each match object includes match indices which tell us where each group capture starts and ends.

Match indices for numbered groups
This is how we access the captures of numbered groups:

const matchObj = /(a+)(b+)/d.exec('aaaabb');
assert.equal(
  matchObj[1], 'aaaa'
);
assert.equal(
  matchObj[2], 'bb'
);

Due to the regular expression flag /d, matchObj also has a property .indices that records for each numbered group where it was captured in the input string:

assert.deepEqual(
  matchObj.indices[1], [0, 4]
);
assert.deepEqual(
  matchObj.indices[2], [4, 6]
);

Match indices for named groups
The captures of named groups are accessed likes this:

const matchObj = /(?<as>a+)(?<bs>b+)/d.exec('aaaabb');
assert.equal(
  matchObj.groups.as, 'aaaa');
assert.equal(
  matchObj.groups.bs, 'bb');

Their indices are stored in matchObj.indices.groups:

assert.deepEqual(
  matchObj.indices.groups.as, [0, 4]);
assert.deepEqual(
  matchObj.indices.groups.bs, [4, 6]);
20
Q

Explain regexp /m (.multiline) flag

A

If the /m (.multiline) flag is on, ^ matches the beginning of each line and $ matches the end of each line. If it is off, ^ matches the beginning of the whole input string and $ matches the end of the whole input string.

> 'a1\na2\na3'.match(/^a./gm)
[ 'a1', 'a2', 'a3' ]
> 'a1\na2\na3'.match(/^a./g)
[ 'a1' ]
21
Q

Explain regexp /s (.dotAll) flag

A

By default, the dot does not match line terminators. With the /s (.dotAll) flag, it does:

> /./.test('\n')
false
> /./s.test('\n')
true

Workaround: If /s isn’t supported, we can use [^] instead of a dot.

> /[^]/.test('\n')
true
22
Q

Explain regexp /y (.sticky) flag

A

/y (.sticky): This flag mainly makes sense in conjunction with /g. When both are switched on, any match must directly follow the previous one (that is, it must start at index .lastIndex of the regular expression object). Therefore, the first match must be at index 0.

> 'a1a2 a3'.match(/a./gy)
[ 'a1', 'a2' ]
> '_a1a2 a3'.match(/a./gy) // first match must be at index 0
null
> 'a1a2 a3'.match(/a./g)
[ 'a1', 'a2', 'a3' ]
> '_a1a2 a3'.match(/a./g)
[ 'a1', 'a2', 'a3' ]

The main use case for /y is tokenization (during parsing)

23
Q

Explain regexp /u (.unicode) flag

A

The /u (.unicode) flag provides better support for Unicode code points.

24
Q

Explain regexp /v (.unicodeSets) flag

A

The /v (.unicodeSets) flag improves on flag /u and provides limited support for multi-code-point grapheme clusters. It also supports set operations in character classes.

25
Q

In which order should we list regular expression flags?

A

Doesn’t really matter but if you need to choose one do them in alfphabetical order.

JavaScript also uses it for the RegExp property .flags :

> /-/gymdivs.flags
'dgimsvy'
26
Q

Why do we need /u and /v flags?

A

Without the flags /u and /v, most constructs work with single UTF-16 code units – which is a problem whenever there is a code point with two code units – such as 🙂:

> '🙂'.length
2

We can use code unit escapes – \u followed by four hexadecimal digits:

> /^\uD83D\uDE42$/.test('🙂')
true

The dot operator (.) matches code units:

> '🙂'.match(/./g)
[ '\uD83D', '\uDE42' ]

Quantifiers apply to code units:

> /^🙂{2}$/.test('\uD83D\uDE42\uDE42')
true

> /^\uD83D\uDE42{2}$/.test('\uD83D\uDE42\uDE42') // equivalent
true

Character class escapes define sets of code units:

> '🙂'.match(/\D/g)
[ '\uD83D', '\uDE42' ]

Character classes define sets of code units:

> /^[🙂]$/.test('🙂')
false
> /^[\uD83D\uDE42]$/.test('\uD83D\uDE42') // equivalent
false
> /^[🙂]$/.test('\uD83D')
true
27
Q

What is the problem with this Regexp /-/uv?

A

Flag /v and flag /u are mutually exclusive – we can’t use them both at the same time:

assert.throws(
  () => eval('/-/uv'),
  SyntaxError
);
28
Q

When should you use the /v flag (aka unicodesSets?

A

Whenever you can!!

This flag improves many aspects of JavaScript’s regular expressions and should be used by default. If you can’t use it yet because it’s still too new, you can use /u, instead.

29
Q

What are the limitations of the /u flag

A

It doesn’t work with grapheme clusters .

Some font glyphs are represented by grapheme clusters (code point sequences) with more than one code point – e.g. 😵:

> Array.from('😵').length // count code points
3

Flag /u does not help us with those kinds of grapheme clusters:

// Grapheme cluster is not matched by single dot
assert.equal(
  '😵'.match(/./gu).length, 3
);

// Quantifiers only repeat last code point of grapheme cluster
assert.equal(
  /^😵{2}$/u.test('😵'), false
);

// Character class escapes only match single code points
assert.equal(
  /^\p{Emoji}$/u.test('😵'), false
);

// Character classes only match single code points
assert.equal(
  /^[😵]$/u.test('😵'), false
);
30
Q

What are match objects?

A

Several regular expression-related methods return so-called match objects to provide detailed information for the locations where a regular expression matches an input string. These methods are:

* `RegExp.prototype.exec()`- returns `null` or single match objects.
* `String.prototype.match()` - returns `null` or single match objects (if flag `/g` is not set).
* `String.prototype.matchAll()` - returns an iterable of match objects (flag `/g` must be set; otherwise, an exception is thrown).

This is an example:

assert.deepEqual(
  /(a+)b/d.exec('ab aaab'),
  {
    0: 'ab',
    1: 'a',
    index: 0,
    input: 'ab aaab',
    groups: undefined,
    indices: {
      0: [0, 2],
      1: [0, 1],
      groups: undefined
    },
  }
);

The result of .exec() is a match object for the first match with the following properties:

  • [0]: the complete substring matched by the regular expression
  • [1]: capture of numbered group 1 (etc.)
  • .index: where did the match occur?
  • .input: the string that was matched against
  • .groups: captures of named groups
  • .indices: the index ranges of captured groups
    • This property is only created if flag /d is switched on.
31
Q

What are match indices?

A

Match indices are a feature of match objects: If we turn it on via the regular expression flag /d (property .hasIndices), they record the start and end indices of where groups were captured.

32
Q

How do you access the match indices of numbered capture groups?

A

This is how we access the captures of numbered groups:

const matchObj = /(a+)(b+)/d.exec('aaaabb');
assert.equal(
  matchObj[1], 'aaaa'
);
assert.equal(
  matchObj[2], 'bb'
);

Due to the regular expression flag /d, matchObj also has a property .indices that records for each numbered group where it was captured in the input string:

assert.deepEqual(
  matchObj.indices[1], [0, 4]
);
assert.deepEqual(
  matchObj.indices[2], [4, 6]
);
33
Q

How do you access the match indices of named capture groups?

A

The captures of named groups are accessed likes this:

const matchObj = /(?<as>a+)(?<bs>b+)/d.exec('aaaabb');
assert.equal(
  matchObj.groups.as, 'aaaa');
assert.equal(
  matchObj.groups.bs, 'bb');

Their indices are stored in matchObj.indices.groups:

assert.deepEqual(
  matchObj.indices.groups.as, [0, 4]);
assert.deepEqual(
  matchObj.indices.groups.bs, [4, 6]);
34
Q

What does regExp.test(str) method do?

A

The regular expression method .test() returns true if regExp matches str:

> /bc/.test('ABCD')
false
> /bc/i.test('ABCD')
true
> /\.mjs$/.test('main.mjs')
true

With .test() we should normally avoid the /g flag. If we use it, we generally don’t get the same result every time we call the method:

> const r = /a/g;
> r.test('aab')
true
> r.test('aab')
true
> r.test('aab')
false

The results are due to /a/ having two matches in the string. After all of those were found, .test() returns false.

35
Q

What does str.search(regExp) method do?

A

The string method .search() returns the first index of str at which there is a match for regExp:

> '_abc_'.search(/abc/)
1
> 'main.mjs'.search(/\.mjs$/)
4