Regular Expressions Flashcards
What is a regular expression?
Regular expressions are patterns used to match character combinations in strings.
In JavaScript, regular expressions are also objects. These patterns are used with the exec()
and test()
methods of RegExp
, and with the match()
, matchAll()
, replace()
, replaceAll()
, search()
, and split()
methods of String
.
“Regular expressions - JavaScript | MDN” (MDN Web Docs). Retrieved February 5, 2025.
How can you create a regular expression?
The two main ways of creating regular expressions are:
- Literal: compiled statically (at load time).
/abc/iv
- Constructor: compiled dynamically (at runtime).
new RegExp('abc', 'iv')
Both regular expressions have the same two parts:
- The body abc – the actual regular expression.
- The flags
i
andv
. Flags configure how the pattern is interpreted. For example,i
enables case-insensitive matching andv
enables Unicode sets mode.
“Literal vs. constructor” (exploringjs.com). Retrieved February 18, 2025.
How can you clone a regular expression?
There are two variants of the constructor RegExp()
:
-
new RegExp(pattern : string, flags = '') // [ES3]
- A new regular expression is created as specified via pattern. If flags is missing, the empty string''
is used. -
new RegExp(regExp : RegExp, flags = regExp.flags) [ES6]
-regExp
is cloned. Ifflags
is provided, then it determines the flags of the clone.
The second variant is useful for cloning regular expressions, optionally while modifying them. Flags are immutable and this is the only way of changing them – for example:
function copyAndAddFlags(regExp, flagsToAdd='') { // The constructor doesn’t allow duplicate flags; // make sure there aren’t any: const newFlags = Array.from( new Set(regExp.flags + flagsToAdd) ).join(''); return new RegExp(regExp, newFlags); } assert.equal(/abc/i.flags, 'i'); assert.equal(copyAndAddFlags(/abc/i, 'g').flags, 'gi');
“Cloning and non-destructively modifying regular expressions” (exploringjs.com). Retrieved February 18, 2025.
What are the syntax characters of a regular expression?
At the top level of a regular expression, the following characters aren special and are know as syntax characters. They are escaped by prefixing a backslash (\
).
\ ^ $ . * + ? ( ) [ ] { } |
In regular expression literals, we must escape slashes:
> /\//.test('/') true
In the argument of new RegExp()
, we don’t have to escape slashes:
> new RegExp('/').test('/') true
“Syntax characters” (exploringjs.com). Retrieved February 18, 2025.
When is it illegal to scape a non-syntax character?
Without flag /u
and /v
, an escaped non-syntax character at the top level matches itself:
> /^\a$/.test('a') true
With flag /u
or /v
, escaping a non-syntax character at the top level is a syntax error:
assert.throws( () => eval(String.raw`/\a/v`), { name: 'SyntaxError', message: 'Invalid regular expression: /\\a/v: Invalid escape', } ); assert.throws( () => eval(String.raw`/\-/v`), { name: 'SyntaxError', message: 'Invalid regular expression: /\\-/v: Invalid escape', } );
“Illegal top-level escaping” (exploringjs.com). Retrieved February 19, 2025.
What are character classes [. . .]
?
A character class wraps class ranges in square brackets. The class ranges specify a set of characters:
-
[«class ranges»]
matches any character in the set. -
[^«class ranges»]
matches any character not in the set.
Rules for class ranges:
- Non-syntax characters stand for themselves:
[abc]
- Only the following four characters are special and must be escaped via slashes:
^ \ - ]
-
^
only has to be escaped if it comes first. -
-
need not be escaped if it comes first or last.
-
- Character escapes (
\n
,\u{1F44D}
, etc.) have the usual meaning. - Character class escapes (
\d,
\P{White_Space}
,\p{RGI_Emoji}
, etc.) have the usual meanings. - Ranges of characters are specified via dashes:
[a-z]
Watch out: \b
stands for backspace. Elsewhere in a regular expression, it matches word boundaries.
“Syntax: character classes” (exploringjs.com). Retrieved February 20, 2025.
What are the scaping rules inside character classes [. . . ]
?
Rules for escaping inside character classes without flag /v
:
We always must escape: \
]
Some characters only have to be escaped in some locations:
* -
only has to be escaped if it doesn’t come first or last.
* ^
only has to be escaped if it comes first.
Rules with flag /v
:
A single ^
only has to be escaped if it comes first.
Class set syntax characters have to be escaped:
( ) [ ] { } / - \ |
Class set reserved double punctuators have to be escaped:
&& !! ## \$\$ %% ** ++ ,, .. :: ;; << == >> ?? @@ ^^ `` ~~
“Escaping inside character classes ([···])” (exploringjs.com). Retrieved February 19, 2025.
What are syntax atoms of regular expressions?
Atoms are the basic building blocks of regular expressions.
-
Pattern characters are all characters except syntax characters (
^
,$
, etc.). Pattern characters match themselves. Examples:A b %
-
.
matches any character. We can use the flag/s
(dotAll) to control if the dot matches line terminators or not. -
Character escapes (each escape matches a single fixed character). For example:
-
\f
: form feed (FF) -
\n
: line feed (LF) -
\r
: carriage return (CR) -
\t
: character tabulation -
\v
: line tabulation - Arbitrary control characters:
\cA
(Ctrl-A), …,\cZ
(Ctrl-Z) - Unicode code units:
\u00E4
- Unicode code points (require flag
/u
or/v
):\u{1F44D}
-
-
Character class escapes define sets of characters (or character sequences) that match:
-
Basic character class escapes define sets of characters:
\d \D \s \S \w \W
-
Unicode character property escapes [ES2018] define sets of code points:
\p{White_Space}
,\P{White_Space}
, etc.- Require flag
/u
or/v
.
- Require flag
-
Unicode string property escapes [ES2024] define sets of code point sequences:
\p{RGI_Emoji}
, etc.- Require flag /v.
-
Basic character class escapes define sets of characters:
“Syntax: atoms of regular expressions” (exploringjs.com). Retrieved February 20, 2025.
What do the following character class escapes (sets of code units) do: \d \D \s \S \w \W
?
\d
→ Matches any digit (equivalent to [0-9]
).\D
→ Matches any non-digit (equivalent to [^0-9]
).\s
→ Matches any whitespace character (spaces, tabs, line terminators, etc.).\S
→ Matches any non-whitespace character.\w
→ Matches any “word” character (equivalent to [a-zA-Z0-9_]
).\W
→ Matches any non-word character (equivalent to [^a-zA-Z0-9_]
).
Examples:
> 'a7x4'.match(/\d/g) [ '7', '4' ] > 'a7x4'.match(/\D/g) [ 'a', 'x' ] > 'high - low'.match(/\w+/g) [ 'high', 'low' ] > 'hello\t\n everyone'.replaceAll(/\s/g, '-') 'hello---everyone'
“Basic character class escapes (sets of code units): \d \D \s \S \w \W
” (exploringjs.com). Retrieved February 24, 2025.
What are unicode character properties?
In the Unicode standard, each character has properties – metadata describing it. Properties play an important role in defining the nature of a character.
These are a few examples of properties:
* Name: a unique name, composed of uppercase letters, digits, hyphens, and spaces – for example:
* A: Name = LATIN CAPITAL LETTER A
* 🙂: Name = SLIGHTLY SMILING FACE
-
General_Category: categorizes characters – for example:
x: General_Category = Lowercase_Letter
$: General_Category = Currency_Symbol
-
White_Space: used for marking invisible spacing characters, such as spaces, tabs and newlines – for example:
\t: White_Space = True
π: White_Space = False
-
Age: version of the Unicode Standard in which a character was introduced – for example: The Euro sign
€
was added in version 2.1 of the Unicode standard.€: Age = 2.1
-
Block: a contiguous range of code points. Blocks don’t overlap and their names are unique. For example:
S: Block = Basic_Latin (range 0x0000..0x007F)
🙂: Block = Emoticons (range 0x1F600..0x1F64F)
-
Script: is a collection of characters used by one or more writing systems.
- Some scripts support several writing systems. For example, the Latin script supports the writing systems English, French, German, Latin, etc.
- Some languages can be written in multiple alternate writing systems that are supported by multiple scripts. For example, Turkish used the Arabic script before it transitioned to the Latin script in the early 20th century.
- Examples:
α: Script = Greek
Д: Script = Cyrillic
“Unicode character properties” (exploringjs.com). Retrieved February 25, 2025.
What are unicode character property escapes?
With flag /u
and flag /v
, we can use \p{}
and \P{}
to specify sets of code points via Unicode character properties. That looks like this:
-
\p{prop=value}
: matches all characters whose Unicode character property prop has the valuevalue
. -
\P{prop=value}
: matches all characters that do not have a Unicode character propertyprop
whose value isvalue
. -
\p{bin_prop}
: matches all characters whose binary Unicode character propertybin_prop
isTrue
. -
\P{bin_prop}
: matches all characters whose binary Unicode character propertybin_prop
isFalse
.
Comments:
Without the flags /u
and /v
, \p
is the same as p
.
Forms (3) and (4) can be used as abbreviations if the property is General_Category
. For example, the following two escapes are equivalent:
\p{Uppercase_Letter} \p{General_Category=Uppercase_Letter}
Examples:
Checking for whitespace:
> /^\p{White_Space}+$/u.test('\t \n\r') true
Checking for Greek letters:
> /^\p{Script=Greek}+$/u.test('μετά') true
Deleting any letters:
> '1π2ü3é4'.replace(/\p{Letter}/ug, '') '1234'
Deleting lowercase letters:
> 'AbCdEf'.replace(/\p{Lowercase_Letter}/ug, '') 'ACE'
“Unicode character property escapes [ES2018]” (exploringjs.com). Retrieved February 25, 2025.
What are unicode string property scapes?
With /u
, we can use Unicode property escapes (\p{}
and \P{}
) to specify sets of code points via Unicode character properties.
With /v
, we can additionally use \p{}
to specify sets of code point sequences via Unicode string properties (negation via \P{}
is not supported):
> /^\p{RGI_Emoji}$/v.test('⛔') // 1 code point (1 code unit) true > /^\p{RGI_Emoji}$/v.test('🙂') // 1 code point (2 code units) true > /^\p{RGI_Emoji}$/v.test('😵💫') // 3 code points true
Let’s see how the character property Emoji would do with these inputs:
> /^\p{Emoji}$/u.test('⛔') // 1 code point (1 code unit) true > /^\p{Emoji}$/u.test('🙂') // 1 code point (2 code units) true > /^\p{Emoji}$/u.test('😵💫') // 3 code points false
“Unicode string property escapes [ES2024]” (exploringjs.com). Retrieved February 26, 2025.
Regexp syntax quantifiers
By default, all of the following quantifiers are greedy (they match as many characters as possible):
-
?
: match never or once -
*
: match zero or more times -
+
: match one or more times -
{n}
: matchn
times -
{n,}
: matchn
or more times -
{n,m}
: match at leastn
times, at mostm
times.
To make them reluctant (so that they match as few characters as possible), put question marks (?
) after them:
> /".*"/.exec('"abc"def"')[0] // greedy '"abc"def"' > /".*?"/.exec('"abc"def"')[0] // reluctant '"abc"'
“Syntax: quantifiers” (exploringjs.com). Retrieved February 27, 2025.
Regexp syntax assertions
-
^
matches only at the beginning of the input -
$
matches only at the end of the input -
\b
matches only at a word boundary-
\B
matches only when not at a word boundary
-
Syntax: assertions” (exploringjs.com).](https://exploringjs.com/js/book/ch_regexps.html#syntax-assertions) Retrieved February 28, 2025.
What are lookaround assertions
Lookaround assertions are special types of assertions in regular expressions that allow you to match a pattern based on what comes before (lookbehind) or after (lookahead) it, without including those parts in the match
Positive lookahead: (?=«pattern»)
matches if pattern matches what comes next.
Example: sequences of lowercase letters that are followed by an X
.
> 'abcX def'.match(/[a-z]+(?=X)/g) [ 'abc' ]
Note that the X
itself is not part of the matched substring.
Negative lookahead: (?!«pattern»)
matches if pattern does not match what comes next.
Example: sequences of lowercase letters that are not followed by an X
.
> 'abcX def'.match(/[a-z]+(?!X)/g) [ 'ab', 'def' ]
Positive lookbehind: (?<=«pattern»)
matches if pattern matches what came before.
Example: sequences of lowercase letters that are preceded by an X
.
> 'Xabc def'.match(/(?<=X)[a-z]+/g) [ 'abc' ]
Negative lookbehind: (?<!«pattern»)
matches if pattern does not match what came before.
Example: sequences of lowercase letters that are not preceded by an X
.
> 'Xabc def'.match(/(?<!X)[a-z]+/g) [ 'bc', 'def' ]
Example: replace “.js”
with “.html”
, but not in “Node.js”
.
> 'Node.js: index.js and main.js'.replace(/(?<!Node)\.js/g, '.html') 'Node.js: index.html and main.html'
“Lookahead assertions” (exploringjs.com). Retrieved March 3, 2025.
Explain regexp syntax disjunction (|
)
-
^aa|zz$
- matches all strings that start with'aa'
and/or end with'zz'
.- Note that
|
has a lower precedence than^
and$
.
- Note that
-
^(aa|zz)$
- matches the two strings'aa'
and'zz'
. -
^a(a|z)z$
- matches the two strings'aaz'
and'azz'
.
Caveat: this operator has low precedence. Use groups if necessary:
“Syntax: disjunction (|)” (exploringjs.com). Retrieved March 3, 2025.
Explain regexp /i (.ignoreCase)
flag
/i (.ignoreCase)
flag switches on case-insensitive matching:
> /a/.test('A') false > /a/i.test('A') true
“Regular expression flags” (exploringjs.com). Retrieved March 3, 2025.
Explain regexp /g (.global)
flag
/g (.global)
flag fundamentally changes how the following methods work.
RegExp.prototype.test() RegExp.prototype.exec() String.prototype.match()
In a nutshell, without /g
, the methods only consider the first match for a regular expression in an input string. With /g
, they consider all matches.
“Regular expression flags” (exploringjs.com). Retrieved March 3, 2025.
Explain regexp /d (.hasIndices)
flag
Some RegExp-related methods return match objects that describe where the regular expression matched in an input string. If the /d (.hasIndices)
flag is on, each match object includes match indices which tell us where each group capture starts and ends.
Match indices for numbered groups
This is how we access the captures of numbered groups:
const matchObj = /(a+)(b+)/d.exec('aaaabb'); assert.equal( matchObj[1], 'aaaa' ); assert.equal( matchObj[2], 'bb' );
Due to the regular expression flag /d
, matchObj also has a property .indices
that records for each numbered group where it was captured in the input string:
assert.deepEqual( matchObj.indices[1], [0, 4] ); assert.deepEqual( matchObj.indices[2], [4, 6] );
Match indices for named groups
The captures of named groups are accessed likes this:
const matchObj = /(?<as>a+)(?<bs>b+)/d.exec('aaaabb'); assert.equal( matchObj.groups.as, 'aaaa'); assert.equal( matchObj.groups.bs, 'bb');
Their indices are stored in matchObj.indices.groups
:
assert.deepEqual( matchObj.indices.groups.as, [0, 4]); assert.deepEqual( matchObj.indices.groups.bs, [4, 6]);
“Regular expression flags” (exploringjs.com). Retrieved March 3, 2025.
Explain regexp /m (.multiline)
flag
If the /m (.multiline)
flag is on, ^
matches the beginning of each line and $
matches the end of each line. If it is off, ^
matches the beginning of the whole input string and $
matches the end of the whole input string.
> 'a1\na2\na3'.match(/^a./gm) [ 'a1', 'a2', 'a3' ] > 'a1\na2\na3'.match(/^a./g) [ 'a1' ]
“Regular expression flags” (exploringjs.com). Retrieved March 3, 2025.
Explain regexp /s (.dotAll)
flag
By default, the dot does not match line terminators. With the /s (.dotAll)
flag, it does:
> /./.test('\n') false > /./s.test('\n') true
Workaround: If /s
isn’t supported, we can use [^
] instead of a dot.
> /[^]/.test('\n') true
“Regular expression flags” (exploringjs.com). Retrieved March 3, 2025.
Explain regexp /y (.sticky)
flag
/y (.sticky)
: This flag mainly makes sense in conjunction with /g
. When both are switched on, any match must directly follow the previous one (that is, it must start at index .lastIndex
of the regular expression object). Therefore, the first match must be at index 0
.
> 'a1a2 a3'.match(/a./gy) [ 'a1', 'a2' ] > '_a1a2 a3'.match(/a./gy) // first match must be at index 0 null
> 'a1a2 a3'.match(/a./g) [ 'a1', 'a2', 'a3' ] > '_a1a2 a3'.match(/a./g) [ 'a1', 'a2', 'a3' ]
The main use case for /y
is tokenization (during parsing)
“Regular expression flags” (exploringjs.com). Retrieved March 3, 2025.
Explain regexp /u (.unicode)
flag
The /u (.unicode)
flag provides better support for Unicode code points.
“Regular expression flags” (exploringjs.com). Retrieved March 3, 2025.
Explain regexp /v (.unicodeSets)
flag
The /v (.unicodeSets)
flag improves on flag /u
and provides limited support for multi-code-point grapheme clusters. It also supports set operations in character classes.
“Regular expression flags” (exploringjs.com). Retrieved March 3, 2025.
In which order should we list regular expression flags?
Doesn’t really matter but if you need to choose one do them in alfphabetical order.
JavaScript also uses it for the RegExp
property .flags
:
> /-/gymdivs.flags 'dgimsvy'
“How to order regular expression flags?” (exploringjs.com). Retrieved March 7, 2025.
Why do we need /u
and /v
flags?
Without the flags /u
and /v
, most constructs work with single UTF-16 code units – which is a problem whenever there is a code point with two code units – such as 🙂:
> '🙂'.length 2
We can use code unit escapes – \u
followed by four hexadecimal digits:
> /^\uD83D\uDE42$/.test('🙂') true
The dot operator (.
) matches code units:
> '🙂'.match(/./g) [ '\uD83D', '\uDE42' ]
Quantifiers apply to code units:
> /^🙂{2}$/.test('\uD83D\uDE42\uDE42') true > /^\uD83D\uDE42{2}$/.test('\uD83D\uDE42\uDE42') // equivalent true
Character class escapes define sets of code units:
> '🙂'.match(/\D/g) [ '\uD83D', '\uDE42' ]
Character classes define sets of code units:
> /^[🙂]$/.test('🙂') false > /^[\uD83D\uDE42]$/.test('\uD83D\uDE42') // equivalent false > /^[🙂]$/.test('\uD83D') true
“Without /u
and /v
: matching UTF-16 code units” (exploringjs.com). Retrieved March 7, 2025.
What is the problem with this Regexp /-/uv
?
Flag /v
and flag /u
are mutually exclusive – we can’t use them both at the same time:
assert.throws( () => eval('/-/uv'), SyntaxError );
“Flag /v
: limited support for multi-code-point grapheme clusters [ES2024]” (exploringjs.com). Retrieved March 11, 2025.
When should you use the /v
flag (aka unicodesSets?
Whenever you can!!
This flag improves many aspects of JavaScript’s regular expressions and should be used by default. If you can’t use it yet because it’s still too new, you can use /u, instead.
“Flag /v
: limited support for multi-code-point grapheme clusters [ES2024]” (exploringjs.com). Retrieved March 11, 2025.
What are the limitations of the /u
flag
It doesn’t work with grapheme clusters .
Some font glyphs are represented by grapheme clusters (code point sequences) with more than one code point – e.g. 😵:
> Array.from('😵').length // count code points 3
Flag /u
does not help us with those kinds of grapheme clusters:
// Grapheme cluster is not matched by single dot assert.equal( '😵'.match(/./gu).length, 3 ); // Quantifiers only repeat last code point of grapheme cluster assert.equal( /^😵{2}$/u.test('😵'), false ); // Character class escapes only match single code points assert.equal( /^\p{Emoji}$/u.test('😵'), false ); // Character classes only match single code points assert.equal( /^[😵]$/u.test('😵'), false );
“Limitation of flag /u
: handling grapheme clusters with more than one code point” (exploringjs.com). Retrieved March 11, 2025.
What are match objects?
Several regular expression-related methods return so-called match objects to provide detailed information for the locations where a regular expression matches an input string. These methods are:
* `RegExp.prototype.exec()`- returns `null` or single match objects. * `String.prototype.match()` - returns `null` or single match objects (if flag `/g` is not set). * `String.prototype.matchAll()` - returns an iterable of match objects (flag `/g` must be set; otherwise, an exception is thrown).
This is an example:
assert.deepEqual( /(a+)b/d.exec('ab aaab'), { 0: 'ab', 1: 'a', index: 0, input: 'ab aaab', groups: undefined, indices: { 0: [0, 2], 1: [0, 1], groups: undefined }, } );
The result of .exec()
is a match object for the first match with the following properties:
-
[0]
: the complete substring matched by the regular expression -
[1]
: capture of numbered group 1 (etc.) -
.index
: where did the match occur? -
.input
: the string that was matched against -
.groups
: captures of named groups -
.indices
: the index ranges of captured groups- This property is only created if flag
/d
is switched on.
- This property is only created if flag
“Match objects” (exploringjs.com). Retrieved April 1, 2025.
What are match indices?
Match indices are a feature of match objects: If we turn it on via the regular expression flag /d
(property .hasIndices), they record the start and end indices of where groups were captured.
“Match indices in match objects [ES2022]” (exploringjs.com). Retrieved April 2, 2025.
How do you access the match indices of numbered capture groups?
This is how we access the captures of numbered groups:
const matchObj = /(a+)(b+)/d.exec('aaaabb'); assert.equal( matchObj[1], 'aaaa' ); assert.equal( matchObj[2], 'bb' );
Due to the regular expression flag /d
, matchObj also has a property .indices that records for each numbered group where it was captured in the input string:
assert.deepEqual( matchObj.indices[1], [0, 4] ); assert.deepEqual( matchObj.indices[2], [4, 6] );
“Match indices for numbered groups” (exploringjs.com). Retrieved April 2, 2025.
How do you access the match indices of named capture groups?
The captures of named groups are accessed likes this:
const matchObj = /(?<as>a+)(?<bs>b+)/d.exec('aaaabb'); assert.equal( matchObj.groups.as, 'aaaa'); assert.equal( matchObj.groups.bs, 'bb');
Their indices are stored in matchObj.indices.groups
:
assert.deepEqual( matchObj.indices.groups.as, [0, 4]); assert.deepEqual( matchObj.indices.groups.bs, [4, 6]);
“Match indices for named groups” (exploringjs.com). Retrieved April 2, 2025.
What does regExp.test(str)
method do?
The regular expression method .test()
returns true if regExp
matches str
:
> /bc/.test('ABCD') false > /bc/i.test('ABCD') true > /\.mjs$/.test('main.mjs') true
With .test()
we should normally avoid the /g
flag. If we use it, we generally don’t get the same result every time we call the method:
> const r = /a/g; > r.test('aab') true > r.test('aab') true > r.test('aab') false
The results are due to /a/
having two matches in the string. After all of those were found, .test()
returns false
.
“regExp.test(str)
: is there a match? [ES3]” (exploringjs.com). Retrieved April 4, 2025.
What does str.search(regExp)
method do?
The string method .search()
returns the first index of str
at which there is a match for regExp
:
> '_abc_'.search(/abc/) 1 > 'main.mjs'.search(/\.mjs$/) 4
“str.search(regExp)
: at what index is the match? [ES3]” (exploringjs.com). Retrieved April 4, 2025.