regex (re python module) Flashcards
What’s the difference between re.match(pattern, string) and re.search(pattern, string)?
- re.match checks if the pattern matches at the start of the string.
- re.search looks anywhere in the string for the first match.
Example:
Imagine we want to check if a lead’s message starts with a keyword like “Interested”:
import re
message = “Interested in your product, call me at 123-456-7890.”
print(re.match(r’Interested’, message)) # Match object -> matches ‘Interested’
print(re.search(r’\d{3}-\d{3}-\d{4}’, message)) # Match object -> matches ‘123-456-7
Which function returns all non-overlapping matches of a pattern in a string?
re.findall(pattern, string) returns a list of all matches.
Example:
Extract all the phone numbers from a lead’s notes field:
import re
notes = “Call me at 123-456-7890 or at 987-654-3210.”
phone_numbers = re.findall(r’\d{3}-\d{3}-\d{4}’, notes)
print(phone_numbers) # [‘123-456-7890’, ‘987-654-3210’]
Which re function do you use to replace all occurrences of a pattern in a string?
re.sub(pattern, repl, string).
Example:
Normalize phone numbers by removing dashes for storage:
import re
phone = “123-456-7890”
normalized_phone = re.sub(r’-‘, ‘’, phone)
print(normalized_phone) # ‘1234567890’
How can you split a string by a given regex pattern?
Use re.split(pattern, string) to split the string into a list at every match of the pattern.
Example:
Split a lead’s full name into first and last names:
import re
full_name = “John Doe”
name_parts = re.split(r’\s+’, full_name)
print(name_parts) # [‘John’, ‘Doe’]
What’s the benefit of re.compile(pattern)?
It compiles the regex into a reusable RegexObject, which can provide performance gains if you use the same pattern multiple times.
Example:
Check multiple emails for validity during batch lead validation:
import re
pattern = re.compile(r’^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,}$’)
emails = [“valid@example.com”, “invalid@domain”, “user@website.org”]
valid_emails = [email for email in emails if pattern.match(email)]
print(valid_emails) # [‘valid@example.com’, ‘user@website.org’]
Name a few regex meta-characters and their meanings. Like ^ or $.
Examples:
^ start of string
$ end of string
. any character (except newline)
[] character class
| OR
() grouping
+ one or more
* zero or more
? zero or one
What does \d, \w, and \s represent in regex?
\d matches any digit (0-9)
\w matches any word character (A-Za-z0-9_)
\s matches whitespace (spaces, tabs, newlines)
Example:
Check if a lead’s username contains only valid characters:
import re
username = “user_123”
print(re.match(r’^\w+$’, username)) # Match object -> matches ‘user_123’
How do the following quantifiers differ: +, *, ?, {m,n}?
+ = 1 or more
* = 0 or more
? = 0 or 1 (optional)
{m,n} = between m and n times
Which anchors assert the position at the start or end of the string?
^ = start of string
$ = end of string
Example:
Check if a lead’s message starts with “Hello” and ends with “Thanks”:
import re
message = “Hello, I’m interested. Thanks”
print(re.match(r’^Hello.*Thanks$’, message)) # Match object
What does \b (word boundary) do in regex?
\b asserts the position where one side is a word character and the other side is a non-word character (or start/end of the string).
Example:
Check if a lead used the word “free” (not part of another word like “freedom”):
import re
message = “Get your free trial today!”
print(re.search(r’\bfree\b’, message)) # Match object
How do you make a group non-capturing in a regex?
Use (?: … ) instead of ( … ).
Example:
Match the phrases “buy now” or “shop now” in a lead’s response:
import re
message = “I would like to shop now.”
print(re.search(r’(?:buy|shop) now’, message)) # Match object
Validating Email (Simple)
^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,}$
Provide a basic regex pattern that captures US phone numbers (like +1 123 456-7890) with optional +1, parentheses, dashes, etc.
^(?:+?1\s*)?(?:(\d{3})|\d{3})[\s-.]?\d{3}[\s-.]?\d{4}$
Why do we often use a raw string (e.g., r’…’) for regex in Python?
It prevents Python’s backslash escapes from interfering, so you don’t need to escape them twice.
How would you remove all digits from a string using re.sub?
re.sub(r’\d+’, ‘’, your_string)