Lecture 8 - RegEx Flashcards
What is RegEx
- Regular expressions (regex) define patterns using a set of special characters.
- They provide a concise way to ensure input data follows a specific format, eliminating the need for complex conditional logic.
RegEx Syntax
- Literals: Characters that you wish to match in the target.
-
Metacharacters: Special symbols that act as commands to the regex parser, including:
( ) ^ $ 1 * ? { } +
.
Anchors
-
^
: Asserts the position at the start of the string.- Example:
^abc
matches “abc” only if it is at the beginning of the string.
- Example:
-
$
: Asserts the position at the end of the string.- Example:
abc$
matches “abc” only if it is at the end of the string.
- Example:
Common Patterns
-
^qwerty$
: Ensures that the entire string (not just a substring) matches “qwerty”.- Example: Matches “qwerty”, but not “123qwerty” or “qwerty123”.
-
\t
: Matches a tab character.- Example: Matches “hello\tworld” where
\t
is the tab space between “hello” and “world”.
- Example: Matches “hello\tworld” where
-
\n
: Matches a new-line character.- Example: Matches “hello\nworld” where
\n
is the new-line character between “hello” and “world”.
- Example: Matches “hello\nworld” where
-
.
: Matches any single character except possibly\n
.- Example: Matches “a”, “1”, “@” in strings like “a”, “1”, and “@”.
Character Classes
-
[qwerty]
: Matches any single character from the set within the brackets.- Example: Matches “q”, “w”, “e”, “r”, “t”, or “y” in the string “qwerty”.
-
[^qwerty]
: Matches any single character not contained within the brackets.- Example: Matches “a”, “b”, “c” in the string “abc” but not “q”, “w”, “e” in “qwerty”.
-
[a-z]
: Matches any single character within the specified range.- Example: Matches “a”, “b”, “c”, …, “z”.
-
\w
: Matches any word character (equivalent to[a-zA-Z0-9_]
).- Example: Matches “a”, “Z”, “5”, or “_” in strings like “apple”, “Zebra”, “123”, and “_underscore”.
-
\W
: Matches any non-word character.- Example: Matches “!”, “@”, “ “ in strings like “hello!”, “good@morning”, “hello world”.
-
\s
: Matches any white-space character.- Example: Matches “ “ (space), “\t” (tab), “\n” (new-line) in strings like “hello world”, “hello\tworld”, “hello\nworld”.
-
\S
: Matches any non-white-space character.- Example: Matches “h”, “e”, “l”, “o” in the string “hello”.
-
\d
: Matches any digit.- Example: Matches “0”, “1”, …, “9” in strings like “12345”.
-
\D
: Matches any non-digit.- Example: Matches “a”, “!”, “ “ in strings like “abc”, “hello!”, “hello world”.
Quantifiers
- : Indicates zero or more matches of the preceding element.
- Example:
a*
matches “”, “a”, “aa”, “aaa”.
- Example:
-
+
: Indicates one or more matches of the preceding element.- Example:
a+
matches “a”, “aa”, “aaa” but not “”.
- Example:
-
?
: Indicates zero or one match of the preceding element.- Example:
a?
matches “”, “a”.
- Example:
-
{n}
: Matches exactlyn
occurrences of the preceding element.- Example:
a{3}
matches exactly “aaa”.
- Example:
-
{n,}
: Matchesn
or more occurrences of the preceding element.- Example:
a{3,}
matches “aaa”, “aaaa”, “aaaaa”, etc.
- Example:
-
{n,m}
: Matches betweenn
andm
occurrences of the preceding element.- Example:
a{2,4}
matches “aa”, “aaa”, or “aaaa”.
- Example:
Alternation
-
|
: Acts as a logical OR, matching either the pattern before or the pattern after it.- Example:
cat|dog
matches “cat” or “dog”.
- Example:
Grouping
-
()
: Groups multiple elements together as a single unit and captures the matched subexpression.- Example:
(abc)+
matches “abc”, “abcabc”, “abcabcabc”, etc. It also captures the matched group “abc”.
- Example:
Lookahead component
A lookahead component in regular expressions is a type of zero-width assertion that specifies a condition that must be met for a match to be valid, without including that condition in the match itself. Lookaheads allow you to assert whether a certain pattern exists immediately to the right of the current position in the string, without actually consuming any characters in the string.
- Positive lookahead syntax:(?=pattern)
- Example:\d(?=abc)
matches a digit only if it is followed by the string “abc”.
- In “1abc”, “1” is matched.
- In “2xyz”, nothing is matched.
- Negative lookead syntax (?!pattern)
- Example:\d(?!abc)
matches a digit only if it is not followed by the string “abc”.
- In “1abc”, nothing is matched.
- In “2xyz”, “2” is matched.
Catastrophic Backtracking
- Occurs when a regular expression engine tries to explore an exponential number of possible matches due to the complexity of the regex and the nature of the input.
- This can lead to severe performance issues and excessive CPU usage.
How Catastrophic Backtracking Occurs
- The regex engine attempts to match the entire string against the pattern.
- For the input
aaaaaaaaaaaaaaaaaab
, the regex engine will start by trying to match the longest possible string starting with “a” using the(a|a)*
pattern. - Since the pattern
(a|a)
is redundant, the engine essentially treats it asa*
and tries to match “a” repeatedly. - However, the input string is very long, and the regex engine needs to consider all possible positions where the “b” might appear.
- The engine first tries to match the input by consuming as many “a”s as possible before the “b”. If this fails, it backtracks to check other possible ways to split the string.
- Because the pattern
(a|a)*
allows for different ways to consume the “a”s, the engine ends up trying an exponential number of possibilities, especially as the input length increases. - For the given input, this results in catastrophic backtracking, where the engine attempts many different ways to match the input, leading to excessive computation time and potentially causing performance problems.
How to avoid Catastrophic Backtracking
-
Simplify Regular Expressions:
- Avoid redundant or unnecessary alternatives in patterns. In this case, the pattern
a*
would suffice.
- Avoid redundant or unnecessary alternatives in patterns. In this case, the pattern
-
Use Possessive Quantifiers or Atomic Groups:
- Some regex engines support possessive quantifiers (e.g.,
a++
), which prevent backtracking. - Atomic groups can also be used to prevent backtracking.
- Some regex engines support possessive quantifiers (e.g.,
-
Optimize Regex Patterns:
- Ensure patterns are designed to avoid ambiguity and excessive backtracking.
-
Limit Input Length:
- While not always possible, restricting input length can help reduce the potential for catastrophic backtracking.
Reluctant(Lazy) Operators
-
^.*?
- Matches any character (.
), any number of times (``), but as few times as possible (lazy?
), from the start of the string (^
). -
<(.+?)>
- Matches the<
character, then captures one or more characters (.+?
), but as few as possible (lazy+?
), followed by the>
character. This is capturing the opening HTML tag. -
(.*?)
- Captures any character (.
), any number of times (``), but as few times as possible (lazy?
), inside parentheses. This is capturing the content between the tags. -
</\1>
- Matches the closing HTML tag (</>
), where\1
refers to the first captured group (the tag name). -
(.*)
- Matches any character (.
), any number of times (``), as many as possible (greedy), from the end of the tag until the end of the string.
Greedy Operator
-
Greedy Operator:Tries to match as much as possible.
- Examples:``,
+
,{n,}
- Examples:``,
-
Lazy (Reluctant) Operator:Tries to match as little as possible.
- Examples:
?
,+?
,{n,}?
- Examples:
Use in the Regex:
-
.*?
and.+?
are lazy operators, meaning they will match as little as possible. -
.*
is a greedy operator, meaning it will match as much as possible.
Word boundaries
- The regex is trying to match HTML-like tags and their content, ensuring it captures the smallest possible matching content using lazy operators, but will extend to the end of the string if necessary using the greedy operator at the end.
-
\b
: This is a word boundary anchor. It matches the position where a word starts or ends. A word boundary occurs wherever there is a transition between a word character (\w
) and a non-word character (like spaces, punctuation, or the start/end of the string). -
\w+
: This matches one or more word characters. In regex,\w
matches any word character, which includes letters (both uppercase and lowercase), digits, and underscores ([A-Za-z0-9_]
). -
\b
: Another word boundary anchor, marking the end of the word.
How It Works:
- The regex
\b\w+\b
will match any sequence of word characters that are surrounded by word boundaries. This means it will match full words, but not partial words or characters that are part of larger words.