Chapter 11: Regular Expressions Flashcards
import regular expressions
import re
print lines with ‘From:’ after searching through file
import re hand=open(...txt) for line in hand: line=line.rstrip if re.search('From:', line) print(line)
special character in reg expressions used to signify the str starts with
str ends with
differentiate between this function and actually trying to find a ^ or $ char:
^, ie if re.search(‘^From:’, line)
$
prefix with ‘escape character’, \
so ‘^blah’ means str starts with blah, ‘\^blah’ means string starts with ^blah. \ cancels out the effect.
special char in REs to signify ‘any character’:
. ie ‘F..m:’ will find ‘F@Pm:’
exept new line
special chars in REs to signify zero-or-more and on-or-more chars
* or + respecctiely, ie ('^From:+@', line) would mean From: followed by one or more characters, then @. can be applied to any charcter or class or character, as \S below. often used after . or \S, as will modify character to the left.
what if there are multiple @s?
senses the last one. Can be modulated to change its behaviour with ?, eg *?. applies to RE quantifiers below. Means will stop at first instance, rather than last
special char in REs to signify ‘any (single) non- white-space character’:
zero or more nw chars:
\S
\S*
\S+ obvs is one or more
(* and + apply to the special char directly to left)
use re.findall() to make a list of all email addresses in a str
import re
str=…..
lst=re.findall(‘\S@\S+’, str)
(domains can’t be just 1 char, usernames can)
single non-white-space lowercase, uppercase letter or nuber followed by 0-or-more non-white characters
[a-zA-Z0-9]/S*
sqr brackets doesn’t modulate \S, the two components as a whole just have a specific meaning. if \S* were infront, would mean 0 or more nw chars then a-z/A-Z/0-9. Specify the sqr bracket contents is what the string ends (or starts) with, excluding non-matccching chars.
other RE quantifiers (than * or +) 0 or 1 chars exactly (given number) chars between 3 and 7 chars 4 or more up to 6
these behave same as + and * ? {given number} {3, 7} {4,} {,6}
other RE character classes (than \S): one digit one non-digit white space char non ws char word character non word char one char which is a 4 5 6 or decimal one char except a b or c one char in the sqr brackets a or c backspace char
\d \D \s \S \w \W [4-6.] .=period in sqr brackets, 'not any char' [^a-c] [a9g&£ja] a|c (shit+\, not L.lower()) [\b]
other groups of special chars not mentioned but can be found: https://www.debuggex.com/cheatsheet/regex/python
RE groups, RE assertions, RE flags, RE replacement
parentheses in re.findall()
eg re.findall(‘X-.*: ([0-9.]+))
specify what you want returned. brackets ignored while searching for substr matching, but will only return contents of ()
matches empty string, but only at start or end of word
same, but not at start or end of word
\b
\B