Lecture 13 Revision: Regular Expressions (Regex) Cont. Flashcards
In Regex what is the wildcard character?
To match ANY character use a full stop (period) .
* The period character matches ONE character,
which can be any character of any type.
- Example:– Find any three characters:
r’…’
– Find any line that contains only one character:
r’^.$’
So how do you match a . (full stop / period character)?
Remember the period or full stop is a special char, so need to use . to escape it.
What is REPETITION?
Can use REPETITION to reduce the size of regex patterns.
Example:– Match any word that contains three lowercase letters:
r’\b[a-z][a-z][a-z]\b’
Does not scale well: What if I had said 50 characters?
* Using repetition
r’\b[a-z]{3}\b
The REPETITION is in the curly brackets
What is the syntax for using REPETITION?
- There are multiple ways to specify repetition
– 5 ‘a’ characters
a{5}
– 5 or more ‘a’ characters.
a{5,}
– between 5 and 7 ‘a’ characters
a{5,7}
– between 3 and 5 lowercase characters
r’\b[a-z]{3,5}\b’
What is the special shorthand for repetitions that are used a lot?
- means zero or more occurences. Same as using {0,}
+ means one or mroe occurences. Same as using {1,}
? means zero or one occurence. Same as {0,1}
Need to escape all of these with a back slash in front.
What are ALTERNATIVES in regex?
ALTERNATIVES provide a way to match one of several patterns. In Python, you can use the pipe symbol (|) to specify alternatives. The | acts as an “OR” operator between patterns.
Alternatives cont.: How could we write a pattern to match cat OR bat?
- r’cat|bat’ # using alteration
- r’[cb]at’
# using a character class - r’(c|b)at’ #Best one
Explain this regex pattern:
r’\b(hack|crack)(ing|ed)?\b’
\b after a word break
search for either hack or crack
then the (ing|ed)? gives an optional ing OR ed at the end. Because the ing and ed are in brackets, the ? means that this is optional.
Then word break at the end.
So this will match either hack, crack, hacking, cracking, cracked or hacked.
So how can we EXTRACT the information we find from regex? I.e. how do we extract the match.
Use match.group()
How do we use match.group()?
Use () to match parts of the match. These are stored separately in match.group()
match= re.search(r’([A-Za-z]+), ([0-9]+)’,csv)
The bit in the first brackets is match.group(1), the bit in the second brackets is match.group(2).
name = match.group(1)
number = match.group(2)
What is SUBSTITUTION?
- Regular expressions can be used to perform substitution (search & replace) - like find and replace in word.
- Example: Replace occurrences of ‘H’ with ‘h’
text = re.sub(r’H’, r’h’, text)
What is the difference between re.search() and re.findall()?
re.search() finds the first occurrence of the pattern.
re.findall() will find all occurrences!