Python - Regular Expressions Flashcards

1
Q

Which of the following best describes what “\w” matches in a Python regular expression?

— 1 —
It matches any one character that is a space, tab, return, or newline.
— 2 —
It matches any one character that is a letter, digit, or underscore.
— 3 —
It matches any one character that is not the newline character.
— 4 —
It matches any one character that is not a lowercase w.

A

\w

— 2 —
It matches any one character that is a letter, digit, or underscore.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Which of the following regular expressions would not match with the string “123”?

--- 1 ---
[123]
--- 2 ---
\d\d\d
--- 3 ---
[0-9][0-9][0-9]
--- 4 ---
123
A

string “123”

— 1 —
[123]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

In Python, how many backslashes do you need to express 2 backslashes with regex?

A

You need 4. The first backslash allows you to use the second, and the third backslash allows you to use the fourth.

////

Regular expressions use the backslash character (‘') to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does the following regex mean?

. (Dot)

A

this matches any character EXCEPT a newline

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does the following regex mean?

^ (Caret)

A

Matches the start of the string

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does the following regex mean?

$

A

Matches the end of the string or just before the newline at the end of the string

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does the following regex mean?

*

A

Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible.

ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does the following regex mean?

+

A

Causes the resulting RE to match 1 or more repetitions of the preceding RE, as many repetitions as are possible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does the following regex mean?

?

A

Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.

ab? will match either ‘a’ or ‘ab’.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does the following regex mean?

*?, +?, ??

A

The ‘*’, ‘+’, and ‘?’ qualifiers are all greedy; they match as much text as possible.

Sometimes this behaviour isn’t desired; if the RE is matched against ‘<a> b ‘, it will match the entire string, and not just ‘<a>’.</a></a>

Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched.

Using the RE will match only ‘<a>’.</a>

?? is lazy while ? is greedy.

Given (pattern)??, it will first test for empty string, then if the rest of the pattern can’t match, it will test for pattern.

In contrast, (pattern)? will test for pattern first, then it will test for empty string on backtrack.

The difference is in the order of searching:

“toys?2” searches for toys2, then toy2
“toys??2” searches for toy2, then toys2

But for the case of these 2 patterns, the result will be the same regardless of the input string, since the sequel 2 (after s? or s??) must be matched.</a></a></a>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does the following regex mean?

{some_number}

{m}

A

Specifies that exactly m copies of the previous RE should be matched; fewer matches cause the entire RE not to match.

For example, a{6} will match exactly six ‘a’ characters, but not five.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does the following regex mean?

{m,n}

A

Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible.

For example, a{3,5} will match from 3 to 5 ‘a’ characters.

Omitting m specifies a lower bound of zero, and omitting n specifies an infinite upper bound.

As an example, a{4,}b will match ‘aaaab’ or a thousand ‘a’ characters followed by a ‘b’, but not ‘aaab’.

The comma may not be omitted or the modifier would be confused with the previously described form.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does the following regex mean?

{m,n}?

A

Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as FEW repetitions as possible.

This is the non-greedy version of the previous qualifier.

For example, on the 6-character string ‘aaaaaa’, a{3,5} will match 5 ‘a’ characters, while a{3,5}? will only match 3 characters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does the following regex mean?

\

A

Either escapes special characters (permitting you to match characters like ‘*’, ‘?’, and so forth), or signals a special sequence; special sequences are discussed below.

If you’re not using a raw string to express the pattern, remember that Python also uses the backslash as an escape sequence in string literals; if the escape sequence isn’t recognized by Python’s parser, the backslash and subsequent character are included in the resulting string. However, if Python would recognize the resulting sequence, the backslash should be repeated twice. This is complicated and hard to understand, so it’s highly recommended that you use raw strings for all but the simplest expressions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does the following regex mean?

[ ]

A

Used to indicate a set of characters. In a set:

  • Characters can be listed individually, e.g. [amk] will match ‘a’, ‘m’, or ‘k’.
  • Ranges of characters can be indicated by giving two characters and separating them by a ‘-‘, for example [a-z] will match any lowercase ASCII letter, [0-5][0-9] will match all the two-digits numbers from 00 to 59, and [0-9A-Fa-f] will match any hexadecimal digit. If - is escaped (e.g. [a-z]) or if it’s placed as the first or last character (e.g. [-a] or [a-]), it will match a literal ‘-‘.
  • Special characters lose their special meaning inside sets. For example, [(+)] will match any of the literal characters ‘(‘, ‘+’, ‘’, or ‘)’.
  • Character classes such as \w or \S (defined below) are also accepted inside a set, although the characters they match depends on whether ASCII or LOCALE mode is in force.
  • Characters that are not within a range can be matched by complementing the set. If the first character of the set is ‘^’, all the characters that are not in the set will be matched. For example, [^5] will match any character except ‘5’, and [^^] will match any character except ‘^’. ^ has no special meaning if it’s not the first character in the set.
  • To match a literal ‘]’ inside a set, precede it with a backslash, or place it at the beginning of the set. For example, both [()[]{}] and [{}] will both match a parenthesis.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does the following regex mean?

|

A

A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B.

An arbitrary number of REs can be separated by the ‘|’ in this way. This can be used inside groups (see below) as well.

As the target string is scanned, REs separated by ‘|’ are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the ‘|’ operator is never greedy.

To match a literal ‘|’, use |, or enclose it inside a character class, as in [|].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What does the following regex mean?

A

Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group.

The contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the \number special sequence, described below.

To match the literals ‘(‘ or ‘)’, use ( or ), or enclose them inside a character class: [(], [)].

18
Q

What does the following regex mean?

?=…

A

Matches if … matches next, but doesn’t consume any of the string.

This is called a lookahead assertion.

For example, Isaac (?=Asimov) will match ‘Isaac ‘ only if it’s followed by ‘Asimov’.

19
Q

What does the following regex mean?

?!…

A

Matches if … doesn’t match next.

This is a negative lookahead assertion.

For example, Isaac (?!Asimov) will match ‘Isaac ‘ only if it’s not followed by ‘Asimov’.

20
Q
# This program tries to use a regular expression
# to identify all words in the sequence below that
# contain the character sequence 'an'. See if you
# can fix it!
word_sequence = ('apple', 'banana', 'orange')
for word in word_sequence:
    match = re.search(r'an', word)
    if match is not None:
        print(word)
A

import re

word_sequence = ('apple', 'banana', 'orange')
for word in word_sequence:
    match = re.search(r'an', word)
    if match is not None:
        print(word)
21
Q
# This program tries to use a regular expression
# to identify all words that start with the letter
# 'a'. Instead, it identifies all three words in
# the sequence below as words that contain the letter
# 'a' somewhere. See if you can change it so it only
# finds words that start with 'a'!

import re

word_sequence = ('apple', 'banana', 'orange')
for word in word_sequence:
    match = re.search(r'a.*', word)
    if match is not None:
        print(word)
A

import re

word_sequence = ('apple', 'banana', 'orange')
for word in word_sequence:
    match = re.search(r'^a.*', word)
    if match is not None:
        print(word)
22
Q

Write a program that outputs only the words in /usr/share/dict/words that start with the letters “ply”. It should output the words in order, each on their own line.

A

import re

with open(‘/usr/share/dict/words’) as f:

for word in f:
match = re.search(r’^ply’, word)
if match is not None:
print(word)

23
Q

Write a program that outputs only the words in /usr/share/dict/words that have exactly five characters and start and end with a lowercase “o”. It should output the words in order, each on their own line.

A

import re

with open(‘/usr/share/dict/words’, ‘r’) as f:
for word in f:
match = re.search(r’^o.{3}o$’, word)
if match is not None:
print(word)

24
Q

What does the following regex mean?

\w

A

matches upper and lower case letters, digits, and underscore

25
Q

What does the following regex mean?

\s

A

matches blank space (space, tab, newline, carriage return)

26
Q

What does the following regex mean?

\d

A

matches any digit 0-9

27
Q

Suppose you have the following regular expression:

^ap.+l.$
Which of the words below would that expression match?

apple
aplenty
pal
pals
plenty
A

apple
pal
pals

28
Q

Which of the following regular expressions matches only the first four of these words?

bowls
books
bogus
bossy
bistro
bison
bakery
obligates
\_\_\_\_\_\_\_\_\_\_

^bo+s

^bo+.*s

^b.[os]+.

^bos+.

A

^bo+.*s

29
Q

What does the following regex mean?

\W

A

matches NOT a word char (NOT a letter, digit, or underscore)

30
Q
# This program tries to find only six-letter words in
# the list of words declared below. See if you can fix 
# it!

import re

regex = re.compile(r'.{6}$')
words = ('apple', 'banana', 'orange', 'grapefruit')

for word in words:
if regex.search(word):
print(word)

A

import re

regex = re.compile(r'^.{6}$')
words = ('apple', 'banana', 'orange', 'grapefruit')

for word in words:
if regex.search(word):
print(word)

31
Q
# This program tries to identify which of the strings
# in a sequence has multiple instances of the letter
# 'e' in a row. It then tries to print just those 
# repetitions of the letter 'e'. See if you can fix
# it!

import re

strings = (‘knee’, ‘meet’, ‘eeeeeeee’, ‘set’)

# Hint: How do you make the set of instances of the
# letter 'e' a group?
regex = re.compile(r'e{2,}')
for string in strings:
    result = regex.search(string)
    if result is not None:
        print(result.group(1))
A

import re

strings = (‘knee’, ‘meet’, ‘eeeeeeee’, ‘set’)

# Hint: How do you make the set of instances of the
# letter 'e' a group?
regex = re.compile(r'(e{2,})')
for string in strings:
    result = regex.search(string)
    if result is not None:
        print(result.group(1))
32
Q

Write a program that outputs only the words in /usr/share/dict/words that start with the letters “ply” and have at most six characters. It should output the words in order, each on their own line.

A

import re

with open(‘/usr/share/dict/words’, ‘r’) as f:
for word in f:
result = re.search(r’^ply.{0,3}$’, word)
if result is not None:
print(word)

33
Q

Write a program that outputs only the words in /usr/share/dict/words that start with a lowercase u, have at least five instances of a lowercase u, and end in a lowercase s. It should output the words in order, each on their own line.

A

import re

with open(‘/usr/share/dict/words’, ‘r’) as f:
for word in f:
result = re.search(r’^u.u.u.u.u.*s$’, word)
if result is not None:
print(word)

34
Q

What does the search() method return?

A

It returns a match object which is either None or not None, which either matches the pattern or not
(finds only FIRST occurrence of the RE in the string)

You can either do:

1)  
pattern = re.compile(r'my_reg_expression')
result = pattern.search(some_string)
if result is not None:
      print(word)

2)
result = re.search(r’my_reg_expression’, some_string)
if result is not None:
print(word)

35
Q

What does compile() method return?

A

It returns a compiled version of the regular expression
(saves processing time for later)

You can do:

pattern = re.compile(r’my_reg_expression’)
result = pattern.search(some_string)
if result is not None:
print(word)

36
Q

What does the match() method return?

A

Determine if the RE matches at the beginning of the string.

return None if no match can be found. If they’re successful, a match object instance is returned, containing information about the match: where it starts and ends, the substring it matched, and more.

37
Q

What does the findall() method return?

A

Find all substrings where the RE matches, and returns them as a list.

38
Q

What does the finditer() method return?

A

Find all substrings where the RE matches, and returns them as an iterator.

39
Q

Write a simple program to display the number of times the regex has matched:

import re

regex = re.compile(r”your pattern here…*”)

A

import re

regex = re.compile(r”your pattern here…*”)

match = regex.findall(“the contents…”)

len(match)

40
Q

What does the following regex do?

\b

A

Matches the word boundary

Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of word characters.

This means that r’\bfoo\b’ matches ‘foo’, ‘foo.’, ‘(foo)’, ‘bar foo baz’ but not ‘foobar’ or ‘foo3’.

41
Q

Write the regular expression that matches the user input below:

user_word = input('Enter word: ')
pattern = re.compile( ...your answer here... )

Your search must be case-insensitive. (This means if you are looking up the word “apple”, then “apple” and “Apple” both count.)

Your search must not count words that contain your word. (This means if you are looking up the word “apple”, then “applesauce” does not count.)

Note that this also includes possessives! So if you are looking for the word “Jonathan”, then “Jonathan’s” does not count. If you are looking for “Jonathan’s”, on the other hand, “Jonathan’s” should count, but not “Jonathan” or “Jonathans”.

Your search must count instances of a word that have trailing punctuation marks. (This means if you are looking up the word “apple” and it occurs at the end of a sentence as “apple.”, that occurrence should still count.)

A

r’\b[’

+ user_word[0].lower()

+ user_word[0].upper()

+ r’]’

+ user_word[1:]

+ r’( [^A-Za-z0-9-'] | \s | $ )’

42
Q

What does the following regex do?

r’ (a | b) ‘

A

matches the character a or the character b

notice the use of the OR regex ‘|’ needs to be grouped together with parentheses