Chapter 11: Regex Flashcards
Code that works when the input data is in a particular format but is prone to breakage if there is some deviation from the correct format. Aka easily broken.
brittle code
regex + and * characters expand outward to match the LARGEST possible string
greedy matching
A command available in most Unix systems that searches through text files looking for lines that match regular expressions.
General Regular Expression Parser
A language for expressing more complex search strings. May contain special characters that indicate that a search only matches at the beginning or end of a line or many other similar capabilities.
regular expression
A special character that matches any character. In regular expressions it’s the period.
wild card
regular expression module
import re
regex method that finds a specified regular expression in text, returns match object, search string)
regex that matches beginning of line
’^’‘^From:’, line)
regex that matches any character (a wildcard)
. (period/full stop)‘F..m’, line) = From, Flam, F#om, etc.
regex that applies to the immediately preceding character(s) and indicates to match zero or more times.
regex that applies to the immediately preceding character(s) and indicates to match one or more times.
regex method that returns a list of substring(s) that matches a regular expression
re.findall(substring, search string)
For loop: [‘substring1’][‘substring2’]
regex that matches non-whitespace character
regex format to accept specific characters
Set notation
regex format to match an actual period
[.] or \.
When added to a regular expression, they are ignored for the purpose of matching, but allow you to extract a particular subset of the matched string rather than the whole string when using findall().
re.findall(‘substring(part I want)’, string)
technique to insert regex as literal character
\$ = ‘$’
regex that anchors to end of line
regex that matches a whitespace character
regex that applies to the immediately preceding character(s) and indicates to match zero or more times in “non-greedy mode”.
regex that applies to the immediately preceding character(s) and indicates to match one or more times in “non-greedy mode”.
regex that applies to the immediately preceding character(s) and indicates to match zero or one time.
regex that applies to the immediately preceding regular expression and indicates to match zero or one time in “non-greedy mode”.
regex that matches a single character as long as that character is in the specified set. In this example, it would match “a”, “e”, “i”, “o”, “u”, or “-“ but no other characters.
[aeiou-] or [-aeiou]
You can specify ranges of characters using the minus sign. This example is a single character that must be a lowercase letter or a digit.
When the first character in the set notation is a caret, it inverts the logic. This example matches a single character that is anything other than an uppercase or lowercase letter.
regex that asserts where a word begins or ends. This means that r’\bat\b’ matches ‘at’, ‘at.’, ‘(at)’, and ‘as at ay’ but not ‘attempt’ or ‘atlas’.
regex that asserts where a word does NOT begin or end. This means that r’at\B’ matches ‘athens’, ‘atom’, ‘attorney’, but not ‘at’, ‘at.’, or ‘at!’.
regex that matches any decimal digit;
equivalent to the set [0-9].
regex that matches any non-digit character;
equivalent to the set [^0-9].
In Unix/Linux, command-line program similar to the search() function
Generalized Regular Expression Parser
$ grep ‘^From:’ mbox.short.txt
Unix linux regex for non-blank character
[^ ]
regex + and * characters expand outward to match the SMALLEST possible string
non-greedy matching
Specifies that exactly m copies of the
previous regular expression should be matched; fewer matches cause the entire regular expression not to match
Causes the resulting regular expression to greedily match from m to n repetitions of the preceding regular expression.
Causes the resulting regular expression to non-greedily match from m to n repetitions of the preceding regular expression
Creates a regular expression that will match either A or B.
This operation is never greedy
cat|dog = cat or dog
Matches the contents of the group of the same number. Groups are numbered starting from 1.
For example, (.+) \1 matches ‘the the’ or ‘55 55’, but not ‘thethe’ (note the space after the group).
(abc)\1matches abcabc. In which,(abc)is a capturing group, and \1is a backreference that matches the same text as captured by the capturing group, so,\1 matches abc too.