Regular Expressions Flashcards
What is a regular expression?
Regular expressions are patterns used to match character combinations in strings.
In JavaScript, regular expressions are also objects. These patterns are used with the exec()
and test()
methods of RegExp
, and with the match()
, matchAll()
, replace()
, replaceAll()
, search()
, and split()
methods of String
.
“Regular expressions - JavaScript | MDN” (MDN Web Docs). Retrieved February 5, 2025.
How can you create a regular expression?
The two main ways of creating regular expressions are:
- Literal: compiled statically (at load time).
/abc/iv
- Constructor: compiled dynamically (at runtime).
new RegExp('abc', 'iv')
Both regular expressions have the same two parts:
- The body abc – the actual regular expression.
- The flags
i
andv
. Flags configure how the pattern is interpreted. For example,i
enables case-insensitive matching andv
enables Unicode sets mode.
“Literal vs. constructor” (exploringjs.com). Retrieved February 18, 2025.
How can you clone a regular expression?
There are two variants of the constructor RegExp()
:
-
new RegExp(pattern : string, flags = '') // [ES3]
- A new regular expression is created as specified via pattern. If flags is missing, the empty string''
is used. -
new RegExp(regExp : RegExp, flags = regExp.flags) [ES6]
-regExp
is cloned. Ifflags
is provided, then it determines the flags of the clone.
The second variant is useful for cloning regular expressions, optionally while modifying them. Flags are immutable and this is the only way of changing them – for example:
function copyAndAddFlags(regExp, flagsToAdd='') { // The constructor doesn’t allow duplicate flags; // make sure there aren’t any: const newFlags = Array.from( new Set(regExp.flags + flagsToAdd) ).join(''); return new RegExp(regExp, newFlags); } assert.equal(/abc/i.flags, 'i'); assert.equal(copyAndAddFlags(/abc/i, 'g').flags, 'gi');
“Cloning and non-destructively modifying regular expressions” (exploringjs.com). Retrieved February 18, 2025.
What are the syntax characters of a regular expression?
At the top level of a regular expression, the following characters aren special and are know as syntax characters. They are escaped by prefixing a backslash (\
).
\ ^ $ . * + ? ( ) [ ] { } |
In regular expression literals, we must escape slashes:
> /\//.test('/') true
In the argument of new RegExp()
, we don’t have to escape slashes:
> new RegExp('/').test('/') true
“Syntax characters” (exploringjs.com). Retrieved February 18, 2025.
When is it illegal to scape a non-syntax character?
Without flag /u
and /v
, an escaped non-syntax character at the top level matches itself:
> /^\a$/.test('a') true
With flag /u
or /v
, escaping a non-syntax character at the top level is a syntax error:
assert.throws( () => eval(String.raw`/\a/v`), { name: 'SyntaxError', message: 'Invalid regular expression: /\\a/v: Invalid escape', } ); assert.throws( () => eval(String.raw`/\-/v`), { name: 'SyntaxError', message: 'Invalid regular expression: /\\-/v: Invalid escape', } );
“Illegal top-level escaping” (exploringjs.com). Retrieved February 19, 2025.
What are character classes [. . .]
?
A character class wraps class ranges in square brackets. The class ranges specify a set of characters:
-
[«class ranges»]
matches any character in the set. -
[^«class ranges»]
matches any character not in the set.
Rules for class ranges:
- Non-syntax characters stand for themselves:
[abc]
- Only the following four characters are special and must be escaped via slashes:
^ \ - ]
-
^
only has to be escaped if it comes first. -
-
need not be escaped if it comes first or last.
-
- Character escapes (
\n
,\u{1F44D}
, etc.) have the usual meaning. - Character class escapes (
\d,
\P{White_Space}
,\p{RGI_Emoji}
, etc.) have the usual meanings. - Ranges of characters are specified via dashes:
[a-z]
Watch out: \b
stands for backspace. Elsewhere in a regular expression, it matches word boundaries.
“Syntax: character classes” (exploringjs.com). Retrieved February 20, 2025.
What are the scaping rules inside character classes [. . . ]
?
Rules for escaping inside character classes without flag /v
:
We always must escape: \
]
Some characters only have to be escaped in some locations:
-
-
only has to be escaped if it doesn’t come first or last. -
^
only has to be escaped if it comes first.
Rules with flag /v
:
A single ^
only has to be escaped if it comes first.
Class set syntax characters have to be escaped:
( ) [ ] { } / - \ |
Class set reserved double punctuators have to be escaped:
&& !! ## \$\$ %% ** ++ ,, .. :: ;; << == >> ?? @@ ^^ `` ~~
“Escaping inside character classes ([···])” (exploringjs.com). Retrieved February 19, 2025.
What are syntax atoms of regular expressions?
Atoms are the basic building blocks of regular expressions.
-
Pattern characters are all characters except syntax characters (
^
,$
, etc.). Pattern characters match themselves. Examples:A b %
-
.
matches any character. We can use the flag/s
(dotAll) to control if the dot matches line terminators or not. -
Character escapes (each escape matches a single fixed character). For example:
-
\f
: form feed (FF) -
\n
: line feed (LF) -
\r
: carriage return (CR) -
\t
: character tabulation -
\v
: line tabulation - Arbitrary control characters:
\cA
(Ctrl-A), …,\cZ
(Ctrl-Z) - Unicode code units:
\u00E4
- Unicode code points (require flag
/u
or/v
):\u{1F44D}
-
-
Character class escapes define sets of characters (or character sequences) that match:
-
Basic character class escapes define sets of characters:
\d \D \s \S \w \W
-
Unicode character property escapes [ES2018] define sets of code points:
\p{White_Space}
,\P{White_Space}
, etc.- Require flag
/u
or/v
.
- Require flag
-
Unicode string property escapes [ES2024] define sets of code point sequences:
\p{RGI_Emoji}
, etc.- Require flag /v.
-
Basic character class escapes define sets of characters:
“Syntax: atoms of regular expressions” (exploringjs.com). Retrieved February 20, 2025.
What do the following character class escapes (sets of code units) do: \d \D \s \S \w \W
?
\d
→ Matches any digit (equivalent to [0-9]
).\D
→ Matches any non-digit (equivalent to [^0-9]
).\s
→ Matches any whitespace character (spaces, tabs, line terminators, etc.).\S
→ Matches any non-whitespace character.\w
→ Matches any “word” character (equivalent to [a-zA-Z0-9_]
).\W
→ Matches any non-word character (equivalent to [^a-zA-Z0-9_]
).
Examples:
> 'a7x4'.match(/\d/g) [ '7', '4' ] > 'a7x4'.match(/\D/g) [ 'a', 'x' ] > 'high - low'.match(/\w+/g) [ 'high', 'low' ] > 'hello\t\n everyone'.replaceAll(/\s/g, '-') 'hello---everyone'
“Basic character class escapes (sets of code units): \d \D \s \S \w \W
” (exploringjs.com). Retrieved February 24, 2025.
What are unicode character properties?
In the Unicode standard, each character has properties – metadata describing it. Properties play an important role in defining the nature of a character.
These are a few examples of properties:
* Name: a unique name, composed of uppercase letters, digits, hyphens, and spaces – for example:
* A: Name = LATIN CAPITAL LETTER A
* 🙂: Name = SLIGHTLY SMILING FACE
-
General_Category: categorizes characters – for example:
x: General_Category = Lowercase_Letter
$: General_Category = Currency_Symbol
-
White_Space: used for marking invisible spacing characters, such as spaces, tabs and newlines – for example:
\t: White_Space = True
π: White_Space = False
-
Age: version of the Unicode Standard in which a character was introduced – for example: The Euro sign
€
was added in version 2.1 of the Unicode standard.€: Age = 2.1
-
Block: a contiguous range of code points. Blocks don’t overlap and their names are unique. For example:
S: Block = Basic_Latin (range 0x0000..0x007F)
🙂: Block = Emoticons (range 0x1F600..0x1F64F)
-
Script: is a collection of characters used by one or more writing systems.
- Some scripts support several writing systems. For example, the Latin script supports the writing systems English, French, German, Latin, etc.
- Some languages can be written in multiple alternate writing systems that are supported by multiple scripts. For example, Turkish used the Arabic script before it transitioned to the Latin script in the early 20th century.
- Examples:
α: Script = Greek
Д: Script = Cyrillic
“Unicode character properties” (exploringjs.com). Retrieved February 25, 2025.
What are unicode character property escapes?
With flag /u
and flag /v
, we can use \p{}
and \P{}
to specify sets of code points via Unicode character properties. That looks like this:
-
\p{prop=value}
: matches all characters whose Unicode character property prop has the valuevalue
. -
\P{prop=value}
: matches all characters that do not have a Unicode character propertyprop
whose value isvalue
. -
\p{bin_prop}
: matches all characters whose binary Unicode character propertybin_prop
isTrue
. -
\P{bin_prop}
: matches all characters whose binary Unicode character propertybin_prop
isFalse
.
Comments:
Without the flags /u
and /v
, \p
is the same as p
.
Forms (3) and (4) can be used as abbreviations if the property is General_Category
. For example, the following two escapes are equivalent:
\p{Uppercase_Letter} \p{General_Category=Uppercase_Letter}
Examples:
Checking for whitespace:
> /^\p{White_Space}+$/u.test('\t \n\r') true
Checking for Greek letters:
> /^\p{Script=Greek}+$/u.test('μετά') true
Deleting any letters:
> '1π2ü3é4'.replace(/\p{Letter}/ug, '') '1234'
Deleting lowercase letters:
> 'AbCdEf'.replace(/\p{Lowercase_Letter}/ug, '') 'ACE'
“Unicode character property escapes [ES2018]” (exploringjs.com). Retrieved February 25, 2025.
What are unicode string property scapes?
With /u
, we can use Unicode property escapes (\p{}
and \P{}
) to specify sets of code points via Unicode character properties.
With /v
, we can additionally use \p{}
to specify sets of code point sequences via Unicode string properties (negation via \P{}
is not supported):
> /^\p{RGI_Emoji}$/v.test('⛔') // 1 code point (1 code unit) true > /^\p{RGI_Emoji}$/v.test('🙂') // 1 code point (2 code units) true > /^\p{RGI_Emoji}$/v.test('😵💫') // 3 code points true
Let’s see how the character property Emoji would do with these inputs:
> /^\p{Emoji}$/u.test('⛔') // 1 code point (1 code unit) true > /^\p{Emoji}$/u.test('🙂') // 1 code point (2 code units) true > /^\p{Emoji}$/u.test('😵💫') // 3 code points false
“Unicode string property escapes [ES2024]” (exploringjs.com). Retrieved February 26, 2025.
Regexp syntax quantifiers
By default, all of the following quantifiers are greedy (they match as many characters as possible):
-
?
: match never or once -
*
: match zero or more times -
+
: match one or more times -
{n}
: matchn
times -
{n,}
: matchn
or more times -
{n,m}
: match at leastn
times, at mostm
times.
To make them reluctant (so that they match as few characters as possible), put question marks (?
) after them:
> /".*"/.exec('"abc"def"')[0] // greedy '"abc"def"' > /".*?"/.exec('"abc"def"')[0] // reluctant '"abc"'
“Syntax: quantifiers” (exploringjs.com). Retrieved February 27, 2025.
Regexp syntax assertions
-
^
matches only at the beginning of the input -
$
matches only at the end of the input -
\b
matches only at a word boundary-
\B
matches only when not at a word boundary
-
Syntax: assertions” (exploringjs.com).](https://exploringjs.com/js/book/ch_regexps.html#syntax-assertions) Retrieved February 28, 2025.
What are lookaround assertions
Lookaround assertions are special types of assertions in regular expressions that allow you to match a pattern based on what comes before (lookbehind) or after (lookahead) it, without including those parts in the match
Positive lookahead: (?=«pattern»)
matches if pattern matches what comes next.
Example: sequences of lowercase letters that are followed by an X
.
> 'abcX def'.match(/[a-z]+(?=X)/g) [ 'abc' ]
Note that the X
itself is not part of the matched substring.
Negative lookahead: (?!«pattern»)
matches if pattern does not match what comes next.
Example: sequences of lowercase letters that are not followed by an X
.
> 'abcX def'.match(/[a-z]+(?!X)/g) [ 'ab', 'def' ]
Positive lookbehind: (?<=«pattern»)
matches if pattern matches what came before.
Example: sequences of lowercase letters that are preceded by an X
.
> 'Xabc def'.match(/(?<=X)[a-z]+/g) [ 'abc' ]
Negative lookbehind: (?<!«pattern»)
matches if pattern does not match what came before.
Example: sequences of lowercase letters that are not preceded by an X
.
> 'Xabc def'.match(/(?<!X)[a-z]+/g) [ 'bc', 'def' ]
Example: replace “.js”
with “.html”
, but not in “Node.js”
.
> 'Node.js: index.js and main.js'.replace(/(?<!Node)\.js/g, '.html') 'Node.js: index.html and main.html'
“Lookahead assertions” (exploringjs.com). Retrieved March 3, 2025.
Explain regexp syntax disjunction (|
)
-
^aa|zz$
- matches all strings that start with'aa'
and/or end with'zz'
.- Note that
|
has a lower precedence than^
and$
.
- Note that
-
^(aa|zz)$
- matches the two strings'aa'
and'zz'
. -
^a(a|z)z$
- matches the two strings'aaz'
and'azz'
.
Caveat: this operator has low precedence. Use groups if necessary:
“Syntax: disjunction (|)” (exploringjs.com). Retrieved March 3, 2025.
Explain regexp /i (.ignoreCase)
flag
/i (.ignoreCase)
flag switches on case-insensitive matching:
> /a/.test('A') false > /a/i.test('A') true
“Regular expression flags” (exploringjs.com). Retrieved March 3, 2025.
Explain regexp /g (.global)
flag
/g (.global)
flag fundamentally changes how the following methods work.
RegExp.prototype.test() RegExp.prototype.exec() String.prototype.match()
In a nutshell, without /g
, the methods only consider the first match for a regular expression in an input string. With /g
, they consider all matches.
“Regular expression flags” (exploringjs.com). Retrieved March 3, 2025.
Explain regexp /d (.hasIndices)
flag
Some RegExp-related methods return match objects that describe where the regular expression matched in an input string. If the /d (.hasIndices)
flag is on, each match object includes match indices which tell us where each group capture starts and ends.
Match indices for numbered groups
This is how we access the captures of numbered groups:
const matchObj = /(a+)(b+)/d.exec('aaaabb'); assert.equal( matchObj[1], 'aaaa' ); assert.equal( matchObj[2], 'bb' );
Due to the regular expression flag /d
, matchObj also has a property .indices
that records for each numbered group where it was captured in the input string:
assert.deepEqual( matchObj.indices[1], [0, 4] ); assert.deepEqual( matchObj.indices[2], [4, 6] );
Match indices for named groups
The captures of named groups are accessed likes this:
const matchObj = /(?<as>a+)(?<bs>b+)/d.exec('aaaabb'); assert.equal( matchObj.groups.as, 'aaaa'); assert.equal( matchObj.groups.bs, 'bb');
Their indices are stored in matchObj.indices.groups
:
assert.deepEqual( matchObj.indices.groups.as, [0, 4]); assert.deepEqual( matchObj.indices.groups.bs, [4, 6]);
“Regular expression flags” (exploringjs.com). Retrieved March 3, 2025.
Explain regexp /m (.multiline)
flag
If the /m (.multiline)
flag is on, ^
matches the beginning of each line and $
matches the end of each line. If it is off, ^
matches the beginning of the whole input string and $
matches the end of the whole input string.
> 'a1\na2\na3'.match(/^a./gm) [ 'a1', 'a2', 'a3' ] > 'a1\na2\na3'.match(/^a./g) [ 'a1' ]
“Regular expression flags” (exploringjs.com). Retrieved March 3, 2025.
Explain regexp /s (.dotAll)
flag
By default, the dot does not match line terminators. With the /s (.dotAll)
flag, it does:
> /./.test('\n') false > /./s.test('\n') true
Workaround: If /s
isn’t supported, we can use [^
] instead of a dot.
> /[^]/.test('\n') true
“Regular expression flags” (exploringjs.com). Retrieved March 3, 2025.
Explain regexp /y (.sticky)
flag
/y (.sticky)
: This flag mainly makes sense in conjunction with /g
. When both are switched on, any match must directly follow the previous one (that is, it must start at index .lastIndex
of the regular expression object). Therefore, the first match must be at index 0
.
> 'a1a2 a3'.match(/a./gy) [ 'a1', 'a2' ] > '_a1a2 a3'.match(/a./gy) // first match must be at index 0 null
> 'a1a2 a3'.match(/a./g) [ 'a1', 'a2', 'a3' ] > '_a1a2 a3'.match(/a./g) [ 'a1', 'a2', 'a3' ]
The main use case for /y
is tokenization (during parsing)
“Regular expression flags” (exploringjs.com). Retrieved March 3, 2025.
Explain regexp /u (.unicode)
flag
The /u (.unicode)
flag provides better support for Unicode code points.
“Regular expression flags” (exploringjs.com). Retrieved March 3, 2025.
Explain regexp /v (.unicodeSets)
flag
The /v (.unicodeSets)
flag improves on flag /u
and provides limited support for multi-code-point grapheme clusters. It also supports set operations in character classes.
“Regular expression flags” (exploringjs.com). Retrieved March 3, 2025.