Chapter 11: Regex Flashcards
Code that works when the input data is in a particular format but is prone to breakage if there is some deviation from the correct format. Aka easily broken.
brittle code
regex + and * characters expand outward to match the LARGEST possible string
greedy matching
A command available in most Unix systems that searches through text files looking for lines that match regular expressions.
grep
General Regular Expression Parser
A language for expressing more complex search strings. May contain special characters that indicate that a search only matches at the beginning or end of a line or many other similar capabilities.
regular expression
(regex)
A special character that matches any character. In regular expressions it’s the period.
wild card
regular expression module
re
import re
regex method that finds a specified regular expression in text, returns match object
re.search(regex, search string)
regex that matches beginning of line
’^’
re.search(‘^From:’, line)
regex that matches any character (a wildcard)
. (period/full stop)
re.search(‘F..m’, line) = From, Flam, F#om, etc.
regex that applies to the immediately preceding character(s) and indicates to match zero or more times.
*
regex that applies to the immediately preceding character(s) and indicates to match one or more times.
+
regex method that returns a list of substring(s) that matches a regular expression
re.findall(substring, search string)
For loop: [‘substring1’][‘substring2’]
regex that matches non-whitespace character
\S
regex format to accept specific characters
’[]’
Set notation
re.findall(‘[a-zA-Z0-9]’)
regex format to match an actual period
[.] or \.
When added to a regular expression, they are ignored for the purpose of matching, but allow you to extract a particular subset of the matched string rather than the whole string when using findall().
()
re.findall(‘substring(part I want)’, string)
technique to insert regex as literal character
backslash
\$ = ‘$’
regex that anchors to end of line
$
regex that matches a whitespace character
\s
regex that applies to the immediately preceding character(s) and indicates to match zero or more times in “non-greedy mode”.
*?
regex that applies to the immediately preceding character(s) and indicates to match one or more times in “non-greedy mode”.
+?
regex that applies to the immediately preceding character(s) and indicates to match zero or one time.
?
regex that applies to the immediately preceding regular expression and indicates to match zero or one time in “non-greedy mode”.
??
regex that matches a single character as long as that character is in the specified set. In this example, it would match “a”, “e”, “i”, “o”, “u”, or “-“ but no other characters.
[aeiou-] or [-aeiou]
You can specify ranges of characters using the minus sign. This example is a single character that must be a lowercase letter or a digit.
[a-z0-9]
When the first character in the set notation is a caret, it inverts the logic. This example matches a single character that is anything other than an uppercase or lowercase letter.
[^A-Za-z]
regex that asserts where a word begins or ends. This means that r’\bat\b’ matches ‘at’, ‘at.’, ‘(at)’, and ‘as at ay’ but not ‘attempt’ or ‘atlas’.
\b
regex that asserts where a word does NOT begin or end. This means that r’at\B’ matches ‘athens’, ‘atom’, ‘attorney’, but not ‘at’, ‘at.’, or ‘at!’.
\B
regex that matches any decimal digit;
equivalent to the set [0-9].
\d
regex that matches any non-digit character;
equivalent to the set [^0-9].
\D
In Unix/Linux, command-line program similar to the search() function
Generalized Regular Expression Parser
(grep)
$ grep ‘^From:’ mbox.short.txt
Unix linux regex for non-blank character
[^ ]
regex + and * characters expand outward to match the SMALLEST possible string
non-greedy matching
Specifies that exactly m copies of the
previous regular expression should be matched; fewer matches cause the entire regular expression not to match
{m}
Causes the resulting regular expression to greedily match from m to n repetitions of the preceding regular expression.
{m,n}
Causes the resulting regular expression to non-greedily match from m to n repetitions of the preceding regular expression
{m,n}?
Creates a regular expression that will match either A or B.
This operation is never greedy
|
cat|dog = cat or dog
Matches the contents of the group of the same number. Groups are numbered starting from 1.
For example, (.+) \1 matches ‘the the’ or ‘55 55’, but not ‘thethe’ (note the space after the group).
(abc)\1matches abcabc. In which,(abc)is a capturing group, and \1is a backreference that matches the same text as captured by the capturing group, so,\1 matches abc too.
\number