Python - Regular Expressions Flashcards
Which of the following best describes what “\w” matches in a Python regular expression?
— 1 —
It matches any one character that is a space, tab, return, or newline.
— 2 —
It matches any one character that is a letter, digit, or underscore.
— 3 —
It matches any one character that is not the newline character.
— 4 —
It matches any one character that is not a lowercase w.
\w
— 2 —
It matches any one character that is a letter, digit, or underscore.
Which of the following regular expressions would not match with the string “123”?
--- 1 --- [123] --- 2 --- \d\d\d --- 3 --- [0-9][0-9][0-9] --- 4 --- 123
string “123”
— 1 —
[123]
In Python, how many backslashes do you need to express 2 backslashes with regex?
You need 4. The first backslash allows you to use the second, and the third backslash allows you to use the fourth.
////
Regular expressions use the backslash character (‘') to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals
What does the following regex mean?
. (Dot)
this matches any character EXCEPT a newline
What does the following regex mean?
^ (Caret)
Matches the start of the string
What does the following regex mean?
$
Matches the end of the string or just before the newline at the end of the string
What does the following regex mean?
*
Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible.
ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s
What does the following regex mean?
+
Causes the resulting RE to match 1 or more repetitions of the preceding RE, as many repetitions as are possible.
What does the following regex mean?
?
Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
ab? will match either ‘a’ or ‘ab’.
What does the following regex mean?
*?, +?, ??
The ‘*’, ‘+’, and ‘?’ qualifiers are all greedy; they match as much text as possible.
Sometimes this behaviour isn’t desired; if the RE is matched against ‘<a> b ‘, it will match the entire string, and not just ‘</a><a>’.</a>
Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched.
Using the RE will match only ‘</a><a>’.</a>
?? is lazy while ? is greedy.
Given (pattern)??, it will first test for empty string, then if the rest of the pattern can’t match, it will test for pattern.
In contrast, (pattern)? will test for pattern first, then it will test for empty string on backtrack.
The difference is in the order of searching:
“toys?2” searches for toys2, then toy2
“toys??2” searches for toy2, then toys2
But for the case of these 2 patterns, the result will be the same regardless of the input string, since the sequel 2 (after s? or s??) must be matched.</a>
What does the following regex mean?
{some_number}
{m}
Specifies that exactly m copies of the previous RE should be matched; fewer matches cause the entire RE not to match.
For example, a{6} will match exactly six ‘a’ characters, but not five.
What does the following regex mean?
{m,n}
Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible.
For example, a{3,5} will match from 3 to 5 ‘a’ characters.
Omitting m specifies a lower bound of zero, and omitting n specifies an infinite upper bound.
As an example, a{4,}b will match ‘aaaab’ or a thousand ‘a’ characters followed by a ‘b’, but not ‘aaab’.
The comma may not be omitted or the modifier would be confused with the previously described form.
What does the following regex mean?
{m,n}?
Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as FEW repetitions as possible.
This is the non-greedy version of the previous qualifier.
For example, on the 6-character string ‘aaaaaa’, a{3,5} will match 5 ‘a’ characters, while a{3,5}? will only match 3 characters.
What does the following regex mean?
\
Either escapes special characters (permitting you to match characters like ‘*’, ‘?’, and so forth), or signals a special sequence; special sequences are discussed below.
If you’re not using a raw string to express the pattern, remember that Python also uses the backslash as an escape sequence in string literals; if the escape sequence isn’t recognized by Python’s parser, the backslash and subsequent character are included in the resulting string. However, if Python would recognize the resulting sequence, the backslash should be repeated twice. This is complicated and hard to understand, so it’s highly recommended that you use raw strings for all but the simplest expressions.
What does the following regex mean?
[ ]
Used to indicate a set of characters. In a set:
- Characters can be listed individually, e.g. [amk] will match ‘a’, ‘m’, or ‘k’.
- Ranges of characters can be indicated by giving two characters and separating them by a ‘-‘, for example [a-z] will match any lowercase ASCII letter, [0-5][0-9] will match all the two-digits numbers from 00 to 59, and [0-9A-Fa-f] will match any hexadecimal digit. If - is escaped (e.g. [a-z]) or if it’s placed as the first or last character (e.g. [-a] or [a-]), it will match a literal ‘-‘.
- Special characters lose their special meaning inside sets. For example, [(+)] will match any of the literal characters ‘(‘, ‘+’, ‘’, or ‘)’.
- Character classes such as \w or \S (defined below) are also accepted inside a set, although the characters they match depends on whether ASCII or LOCALE mode is in force.
- Characters that are not within a range can be matched by complementing the set. If the first character of the set is ‘^’, all the characters that are not in the set will be matched. For example, [^5] will match any character except ‘5’, and [^^] will match any character except ‘^’. ^ has no special meaning if it’s not the first character in the set.
- To match a literal ‘]’ inside a set, precede it with a backslash, or place it at the beginning of the set. For example, both [()[]{}] and [{}] will both match a parenthesis.
What does the following regex mean?
|
A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B.
An arbitrary number of REs can be separated by the ‘|’ in this way. This can be used inside groups (see below) as well.
As the target string is scanned, REs separated by ‘|’ are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the ‘|’ operator is never greedy.
To match a literal ‘|’, use |, or enclose it inside a character class, as in [|].