Cleaning Data in Python Flashcards

Question 1

Q

Print first 5 rows of DataFrame

Print last 5 rows of DataFrame

Answer

A

print(dataframename.head())

print(dataframename.tail())

Question 2

Q

Get a list of columns of the DataFrame

Answer

A

dataframename.columns

Question 3

Q

Check dimensions of a DataFrame

Answer

A

dataframename.shape

Question 4

Q

Get information about DataFrame

Answer

A

dataframename.info()

#Shows number of rows and columns, column names, number of non-missing values in column, and type of data in each column

Question 5

Q

Count frequency in a DataFrame column (2 equivalent ways)

Answer

A

dataframename. column1.value_counts(dropna=False)
* #Can use this one if column1 name doesn’t contain any special characters or spaces and it’s not the name of a python function*

dataframename[‘column1’].value_counts(dropna=False)

#dropna=False will make it count number of missing values as well

Question 6

Q

chain together value_counts() and head() methods

Answer

A

dataframename.column1.value_counts(dropna=False).head()

Question 7

Q

Summary statistics for a DataFrame

Answer

A

dataframename.describe()

# Will produce summary statistics of numeric data including count (number of non-missing values), mean, std, min, 25%, 50% (median), 75%, max

Question 8

Q

Make a histogram from a DataFrame column

Answer

A

import matplotlib.pyplot as plt

dataframename. columnname.plot(kind=’hist’)
plt. show()

Question 9

Q

Make a boxplot from a DataFrame column

Answer

A

dataframename. boxplot(column=’columnname’, by=’groupingvariable’)
plt. show()

Question 10

Q

Make a scatter plot from two DataFrame columns

Answer

A

dataframe.plot(kind=’scatter’, x=column1, y=column2)

Question 11

Q

What are the 3 principles of Tidy Data

Answer

A

Columns represent separate variables
Rows represent individual observations
Observational units form tables

Question 12

Q

Melt together two columns of a DataFrame

Answer

A

pd.melt(frame=datatframename, id_vars=’column1’, value_vars=[‘column2, ‘column3’], var_name=’columnname’, value_name=’valuesname’)

# id_vars are the columns that you want to keep the same. Note that for multiple id_vars need to make a list[]*
# value_vars are the columns that you want to melt. If none specified, all columns (except ones in id_vars) will melt*
# var_name is the name of the melted column#value_name is the name for the column of values*

Question 13

Q

Pivot one column into two columns

Answer

A

new_df = dataframename.pivot(index=’column1’, columns=’column2’, values=’column3’)

# index – columns that want to keep the same. For multiple entries, make a list[]*
# columns – column that want to pivot into separate columns.*

Question 14

Q

Pivot one column into two columns when there are duplicate values for a variable you’re trying to pivot. Aggregate the duplicate values by taking their average

Answer

A

new_df = dataframename.pivot_table(index=’column1’, columns=’column2’, values=’column3’, aggfunc=np.mean)

# index –columns that want to keep the same. For multiple entries, make a list[]*
# columns - column we want to pivot into separate columns*
# values – values that will be used to fill the columns after pivoting*
# aggfunc – tells python how to handle duplicate values (np.mean is the default)*

Question 15

Q

Flatten the columns of a pivoted DataFrame

Answer

A

dataframename.reset_index()

Question 16

Q

What string method can you use to split a string at a delimiter?

Answer

A

.split(‘delimiter’)

Default delimiter is space

Question 17

Q

Which method can you use to retrieve a a value from a list?

Answer

A

listname.str.get(index)

Question 18

Q

How do you concatenate rows of data?

Answer

A

concatenated = pd.concat([dataframe1, dataframe2])

Question 19

Q

How do you concatenate columns of data?

Answer

A

concatenated = pd.concat([dataframe1, dataframe2], axis=1)

#axis = 1 defines column-wise concatenation. Default is axis=0

Question 20

Q

How do you merge data in cases where concatenating won’t work?

Answer

A

left_on and right_on define the common columns (keys) between the two dataframes

pd.merge(left=dataframe1, right=dataframe2, on=None, left_on=’keydf1’, right_on=’keydf2’)

# on= if the common columns have same name, can use on= to specify. If names are different, can omit this

Question 21

Q

How do you view the datatypes of the columns in a DataFrame?

Answer

A

print(dataframename.dtypes)

Question 22

Q

How do you convert a DataFrame column into string (object dtype)

Answer

A

dataframename[‘column1’] = dataframename[‘column1’].astype(str)

Question 23

Q

How to you convert a DataFrame column into category dtype?

Why might it be beneficial to use this data type?

Answer

A

dataframename[‘column1’] = dataframename[‘column1].astype(‘category’)

Saves memory

Question 24

Q

How do you convert a column containing missing values described by non-numeric characters into numeric data?

Answer

A

dataframename[‘column1’] = pd.to_numeric(dataframename[‘column1’], errors = ‘coerce’)

#errors = ‘coerce’ tells python to turn invalid values (e.g. a dash) to missing (NaN)

Question 25

Q

How do you use regular expressions to match the pattern of a string?

Answer

A

import re

pattern = re.compile(‘pattern’)

result = pattern.match(‘string’)

bool(result) outputs True

or

re.match(pattern=’pattern’, string=’string’)

Question 26

Q

How do you extract multiple numbers from a string using regular expressions?

Answer

A

matches = re.findall(‘pattern’, ‘string’)

Question 27

Q

How do you apply a function to all columns of a DataFrame?

Answer

A

dataframename.apply(functionname,axis=0)

Question 28

Q

How do you apply a function to all rows of a DataFrame?

Answer

A

dataframename.apply(functionname,axis=1)