Lecture 12 Revision: Regular Expressions (regex) Flashcards
What is terse compared to verbose code?
In programming, “terse” and “verbose” are terms used to describe the style or expressiveness of code.
Terse code is concise and uses fewer lines or characters to accomplish a task. It can be efficient and elegant but might be harder to read and understand, especially for those not familiar with the language or code
Verbose code is more detailed and explicit, often using more lines of code to achieve the same result. While it can be easier to read and understand, it might be less efficient or elegant
What are regular Expressions?
A regular expression is a pattern of text that allows us to match information in text documents.
Regarded as being difficult, but probably fairer to say that the notation is terse
what is the re.search() function?
The re.search() function in Python is used to search for a PATTERN in a string. Unlike re.match(), which only checks the beginning of the string, re.search() scans the ENTIRE string to find the FIRST occurrence of the pattern
If a match is found it returns a match object. If not it retruns None.
Can be used to search for text
What is match?
What is match?
In Python’s re module, functions like re.match() or re.search() return a match object if the specified pattern is found in the text.
This match object contains information about the match, such as:
The specific text (substring) that matched the pattern.
The location (start and end) of the match in the original string.
what does group() do? e.g match.group()
The group() method of the match object is used to extract the text (substring) that was matched by the pattern.
e.g. re.search(r’\d+’, text) finds the first substring that matches the pattern (digits \d+).
match.group() retrieves the actual match, which is the first number found in the text.
What would this basic example do?
import re
string = ‘Hello World!’
match = re.search(r’Wo’, string)
if match:
print(‘Match!’)
else:
print(‘No match!’)
This imports the re module
Creates a string that is Hello World!
Creates a variable called match that = the results of a serach for the text ‘Wo’
If there is a match it prints Match! to the screen. Otherwise it prints No match! to the screen.
describe / explain the syntax of re.search()
match = re.search(pattern, string[flags])
Takes in 3 parametres:
- pattern: The regular expression (regex) i.e the thing you are searching for.
- string: The text to search for the pattern in
-flags: optional flags.
e.g match = re.search(r’python’, ‘Monty python’)
or match = re.search(r’Wo’, string)
How do you deal with searching for special characters in re.search()
Simple text is matched exactly except for:
+ ? . * ^ $ ( ) [ ] { } | \
These must be escaped. So to search for a single ? use \?
e.g. string = ‘Hello World!’
match = re.search(r’\?’, string)
Complex searching: What else can be do?
Regular expressions are much more powerful than just matching simple text.
Rather than specifying the exact text that we wish to match, we can specify its properties or pattern.
e.g. Match six lower case letters-or Finding IPv4 addresses in a log file
So how do we make our regex more powerful?
We can use:
- Anchors
– Character classes, Wildchars,
– Repetitions and alternative characters
What are anchors?
Anchors are Special
characters that tell where in
the string a match should
occur:
^ start of string
$ end of string
\b word boundary (word boundary here is anywhere you would seperate a word when typing, so includes a space or a return or tab)
What is a Character Class?
A character class is a set of characters that can match at a specific point in a regular expression
* Denoted by square brackets […]
* Character classes allow us to give some choice as to what is matched
Examples:
– Match any lowercase letter: r’[a-z]
’– Match any uppercase letter: r’[A-Z]’
– Match any letter: r’[a-zA-Z]’– - Match hexadecimal values: r’[a-fA-F0-9]’
How can we negate a Character class? e.g serach for something that is NOT a digit for example.
At the beginning of a character class a ^
symbol can be used to negate a class.
E.g. to negate digits (to match any character that is NOT a digit)
r’[^0-9]’
What are the two special characters we have to be aware of?
- The carat symbol (^) can also be found in text
- The hyphen (-) is used to find a range in a character class
ie. a-z or 0-9 - To search for these 2 characters simply place it at the end of the character class.
Special Characters: What is this asking to match?
r’[0-9^]’
Match any line containing any digit (0-9) or a carat symbol
Special Characters: What is this asking to match?
r’[^0-9]’
Match any line that doesn’t contain digits
Special Characters: What is this asking to match?
r’[0-9]’
Match any line containing a digit
Special Characters: What is this asking to match?
r’[-09]’
match any line containing a hyphen/minus sign, a 0 OR a 9
Whar are some common special predefined character classes that are already built in?
\d means digits same as [0-9]
\s means whitespace
\w means word characters same as [A-Z a-z0-9_]
Can all be negated by using the upper class letter e.g
\D
\S
\W
same as using ^ at the start
What does re.match() do?
re.match() is a method from the re module that checks if a regular expression pattern matches the BEGINNING of a string. If the pattern matches the start of the string, it returns a match object; otherwise, it returns none
What does re.findall() do?
re.findall() is a method from the re module used to find ALL OCCURANCES of a pattern in a given string. It returns a list of all the matching substrings, and if no matches are found, it returns an empty list.
re.findall(pattern, string, flags=0)
Parameters:
pattern: The regular expression pattern to match.
string: The string to search through.
flags (optional): Modify the matching behavior, e.g., re.IGNORECASE