7 Regex Flashcards
How can you add a regex object to a variable?
variable = re.compile(r’regexPattern)
What is a match object?
A match object is returned by the Regex object search() method if you pass it a string (or sth else).
Name the 4 steps of Regular Expression Matching
- Import re
- Create Regex object
»> regObj = re.compile(pattern) - Pass message to Regex object search method
mo = regObj.search(‘message’) - print(mo.group())
What do the numbers in the group methods parentheses indicate?
mo.group()
mo.group(0)
mo.group(1)
mo.group(2)
…
If the regex pattern has several groups, e.g.:
(r’(\d\d\d)-(\d\d\d-\d\d\d\d)’)
the numbers indicate which group is searched for.
mo.group() & mo.group(0) returns the entire text
mo. group(1) returns (\d\d\d)
mo. group(2) returns (\d\d\d-\d\d\d\d)
What does mo.groups() do?
What can you do with it because its return data type?
Returns tuple of all groups:
»> mo.groups()
(‘415’, ‘555-4242’)
You can do the multiple assignment trick:
»> areaCode, mainNumber = mo.groups()
Special characters:
| (pipe)
What is to consider about it?
pipe:
to match one of many expressions, e.g. ‘Batman’ or ‘Batmobile’ with»_space;>(r’Bat(man|mobile)’)
When both ‘man’ and ‘mobile’ occur, the first match is taken
Special characters:
?
*
+
They allow optional matching:
? == 0 or 1 time
‘Batman’ or ‘Batwoman’
»>r’Bat(wo)?man’
- == 0 or more times
‘Batman’ or ‘Batwowowoman’
»>r’Bat(wo)*man’
No longer optional:
+ == 1 or more times
‘Batwoman’ or ‘Batwowowoman’
»>r’Bat(wo)+man’
Special characters:
{}
Match specific repetition patterns:
r’(Ha){3}’ -> HaHaHa
r’(Ha){3, 5}’ -> HaHaHa | HaHaHaHa | HaHaHaHaHa
What is greedy and non-greedy (lazy) matching
How can you switch to lazy matching?
By default, Python does greedy matching, searching for the longest possible match
r’(Ha){3, 5}’ -> searches for HaHaHaHaHa
non-greedy: takes the first found match
r’(Ha){3, 5}?’ -> searches for HaHaHa
Here the ? means non-greedy (not optional as when its used combined with (groups))
findall() method
- Difference towards search()
- return values
- Finds all matches, not only the first
- Does not return a match object, but a list of strings
»>[‘415-555-9999’, ‘212-555-0000’] - Except if there are multiple groups in the Regex Objects, e.g.:
»> re.compile(r’(\d\d\d)-(\d\d\d)-(\d\d\d\d)’)
Than, the return value is a list of tuples, e.g.:
»> [(‘415’, ‘555’, ‘9999’), (‘212’, ‘555’, ‘0000’)]
Shorthand character class
\d
\D
\w
\W
\s
\S
\d any digit 0-9
\D any character NOT a digit
\w any letter, numeric digit or undescore character
\W any character NOT…
\s any space, tab or newline
\S any character NOT…
How can you define your own character class. E.g.
for vowels?
for letters?
for ‘.+*’
By using [square brackets]
> > > vowelRegex = re.compile(r’[aeiouAEIOU]’)
> > > letterRegex = re.compile(r’[a-zA-Z]’)
> > > specCharRegex = re.compile(r’[.+*]’)
How can you make a negative character class
What do they do
By adding ^ after the first square bracket
[^aeiouAEIOU]
Match everything except whats in the brackets
What do the Caret and Dollar Sign do?
Whats the mnemonic to remmember what comes first?
Caret
»>(r’^Hello!’)
pattern must start with ‘Hello!’
Dollar:
»>(r’^Hello!’)
pattern must end with …
Carrots cost Dollars
What does it search for?
|»_space;>re.compile( r’^\d+$’)
For a pattern that starts and ends with one ore more digits
-> and that has nothing else in between
Whats the wildcard character?
It´s the ‘.’ and can be any character except \n, e.g.:
> > > atRegex = re.compile(r’.at’)
Could be: ‘Hat’, ‘Sat’, ‘fat’…
When could that be useful?
1.
»> everythRegex = re.compile(r’.*’)
2.
»> everythRegex = re.compile(r’.*’?)
If you want any combination of 0 or more characters that is:
- as long as possible (greedy)
- short as possible (lazy)
How can Case-Insensitive-Matching be done?
re. compile(r’pAtTeRn’, re.I’)
re. I stands for re.IgnoreCase
How can you substitute?
Agent example
The sub() method for Regex objects is passed two arguments. The first argument is a string to replace any matches. The second is the string for the regular expression.
> > > namesRegex = re.compile(r’Agent \w+’)
namesRegex.sub(‘CENSORED’, ‘Agent Alice gave the secret documents to Agent Bob.’)
‘CENSORED gave the secret documents to CENSORED.
What is re.Verbose useful for?
What code is required
It hightens readability and allows to write in different lines:
re.compile(r’’’(
…
…
…)’’’, re.VERBOSE)
How can you combine ignorecase, dotall and verbose?
> > > someRegexValue = re.compile(‘foo’, re.IGNORECASE | re.DOTALL | re.VERBOSE)