Data cleaning Flashcards

1
Q

Use list slicing to remove the column names (the first row) from the moma list of lists.

A

moma = moma[1:]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

We have created a variable, age1, containing the string “I am thirty-one years old”.

Use the str.replace() method to create a new string, age2:
The new string should have the value “I am thirty-two years old”.

A

age2 = age1.replace(“one”,”two”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Use a for loop to loop over the moma list of lists. In each iteration of the loop:

Clean the Nationality column of the data set by:
Assigning the nationality for each row (found at list index 2 of the row) to a variable.
Using the str.replace() method to remove the open parentheses (() character.
Using the str.replace() method to remove the close parentheses ()) character.
Assigning the cleaned value back to list index 2 of the row.
Clean the Gender column of the data set (found at index 5 of the row) by repeating the same technique you used for the Nationality column.

A
for row in moma:
    # remove parentheses from the nationality column
    nationality = row[2]
    nationality = nationality.replace("(","")
    nationality = nationality.replace(")","")
    row[2] = nationality
    # remove parentheses from the gender column
    gender = row[5]
    gender = gender.replace("(","")
    gender = gender.replace(")","")
    row[5] = gender
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Use a loop to iterate over all rows in the moma list of lists. For each row:

Clean the Gender column.
Assign the value from the Gender column, at index 5, to a variable.
Make the changes to the value of that variable.
Use the str.title() method to make the capitalization uniform.
Use an if statement to check if the value is an empty string. If the value is an empty string, give it the value “Gender Unknown/Other”.

A
for row in moma:
    # fix the capitalization and missing
    # values for the gender column
    gender = row[5]
    gender = gender.title()
    if not gender:
        gender = "Gender Unknown/Other"
    row[5] = gender
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Clean the Nationality column of the data set (found at index 2) by repeating the same technique you used for the Gender column.
For missing values in the Nationality column, use the string “Nationality Unknown”.

A
# values for the nationality column
    nationality = row[2]
    nationality = nationality.title()
    if not nationality:
        nationality = "Nationality Unknown"
    row[2] = nationality
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

We have provided the clean_and_convert() that uses the if statement to bypass missing strings.

Use a for loop to iterate over each row in the moma list of lists. In each iteration:

Assign the BeginDate and EndDate values (at indexes 3 and 4) to variables.
Use the clean_and_convert() function to clean and convert each value.
Assign the converted values back to indexes 3 and 4 so the cleaned values are used in the moma list of lists.

def clean_and_convert(date):
    # check that we don't have an empty string
    if date != "":
        # move the rest of the function inside
        # the if statement
        date = date.replace("(", "")
        date = date.replace(")", "")
        date = int(date)
    return date
A

for row in moma:
birth_date = row[3]
death_date = row[4]

birth_date = clean_and_convert(birth_date)
death_date = clean_and_convert(death_date)

row[3] = birth_date
row[4] = death_date
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Create a function called strip_characters(), which accepts a string argument and:
Iterates over the bad_chars list, using str.replace() to remove each character.
Returns the cleaned string.

test_data = [“1912”, “1929”, “1913-1923”,
“(1951)”, “1994”, “1934”,
“c. 1915”, “1995”, “c. 1912”,
“(1988)”, “2002”, “1957-1959”,
“c. 1955.”, “c. 1970’s”,
“C. 1990-1999”]

bad_chars = [”(“,”)”,”c”,”C”,”.”,”s”,”’”, “ “]

A
bad_chars = ["(",")","c","C",".","s","'", " "]
def strip_characters(string):
    for char in bad_chars:
        string = string.replace(char,"")
    return string
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

The stripped_test_data list, strip_characters() function and bad_chars list are provided for you so you don’t have to toggle between screens to remember what they look like.

Create a function called process_date() which accepts a string, and follows the logic we outlined above:
Checks if the dash character (-) is in the string so we know if it’s a range or not.
If it is a range:
Splits the string into two strings, before and after the dash character.
Converts the two numbers to the integer type and then average them by adding them together and dividing by two.
Uses the round() function to round the average, so values like 1964.5 become 1964.
If it isn’t a range:
Converts the value to an integer type.
Finally, returns the value.
Create an empty list processed_test_data.
Loop over the stripped_test_data list using your process_date() function. Process the dates and append each processed date back to the processed_test_data list.
Once your code works with the test data, you can then iterate over the moma list of lists. In each iteration:
Assign the value from the Date column (index 6) to a variable.
Use the strip_characters() function to remove any bad characters.
Use the process_date() to convert the date.
Assign the stripped and processed value back to the row.

test_data = [“1912”, “1929”, “1913-1923”,
“(1951)”, “1994”, “1934”,
“c. 1915”, “1995”, “c. 1912”,
“(1988)”, “2002”, “1957-1959”,
“c. 1955.”, “c. 1970’s”,
“C. 1990-1999”]

bad_chars = [”(“,”)”,”c”,”C”,”.”,”s”,”’”, “ “]

def strip_characters(string):
    for char in bad_chars:
        string = string.replace(char,"")
    return string

stripped_test_data = [‘1912’, ‘1929’, ‘1913-1923’,
‘1951’, ‘1994’, ‘1934’,
‘1915’, ‘1995’, ‘1912’,
‘1988’, ‘2002’, ‘1957-1959’,
‘1955’, ‘1970’, ‘1990-1999’]

A
def process_date(date):
    if "-" in date:
        split_date = date.split("-")
        date_one = split_date[0]
        date_two = split_date[1]       
        date = (int(date_one) + int(date_two)) / 2
        date = round(date)
    else:
        date = int(date)
    return date

processed_test_data = []

for d in stripped_test_data:
date = process_date(d)
processed_test_data.append(date)

for row in moma:
    date = row[6]
    date = strip_characters(date)
    date = process_date(date)
    row[6] = date
How well did you know this?
1
Not at all
2
3
4
5
Perfectly