Cleaning Data in Python Flashcards
Print first 5 rows of DataFrame
Print last 5 rows of DataFrame
print(dataframename.head())
print(dataframename.tail())
Get a list of columns of the DataFrame
dataframename.columns
Check dimensions of a DataFrame
dataframename.shape
Get information about DataFrame
dataframename.info()
#Shows number of rows and columns, column names, number of non-missing values in column, and type of data in each column
Count frequency in a DataFrame column (2 equivalent ways)
dataframename. column1.value_counts(dropna=False)
* #Can use this one if column1 name doesn’t contain any special characters or spaces and it’s not the name of a python function*
dataframename[‘column1’].value_counts(dropna=False)
#dropna=False will make it count number of missing values as well
chain together value_counts() and head() methods
dataframename.column1.value_counts(dropna=False).head()
Summary statistics for a DataFrame
dataframename.describe()
# Will produce summary statistics of numeric data including count (number of non-missing values), mean, std, min, 25%, 50% (median), 75%, max
Make a histogram from a DataFrame column
import matplotlib.pyplot as plt
dataframename. columnname.plot(kind=’hist’)
plt. show()
Make a boxplot from a DataFrame column
dataframename. boxplot(column=’columnname’, by=’groupingvariable’)
plt. show()
Make a scatter plot from two DataFrame columns
dataframe.plot(kind=’scatter’, x=column1, y=column2)
What are the 3 principles of Tidy Data
- Columns represent separate variables
- Rows represent individual observations
- Observational units form tables
Melt together two columns of a DataFrame
pd.melt(frame=datatframename, id_vars=’column1’, value_vars=[‘column2, ‘column3’], var_name=’columnname’, value_name=’valuesname’)
- # id_vars are the columns that you want to keep the same. Note that for multiple id_vars need to make a list[]*
- # value_vars are the columns that you want to melt. If none specified, all columns (except ones in id_vars) will melt*
- # var_name is the name of the melted column#value_name is the name for the column of values*
Pivot one column into two columns
new_df = dataframename.pivot(index=’column1’, columns=’column2’, values=’column3’)
- # index – columns that want to keep the same. For multiple entries, make a list[]*
- # columns – column that want to pivot into separate columns.*
Pivot one column into two columns when there are duplicate values for a variable you’re trying to pivot. Aggregate the duplicate values by taking their average
new_df = dataframename.pivot_table(index=’column1’, columns=’column2’, values=’column3’, aggfunc=np.mean)
- # index –columns that want to keep the same. For multiple entries, make a list[]*
- # columns - column we want to pivot into separate columns*
- # values – values that will be used to fill the columns after pivoting*
- # aggfunc – tells python how to handle duplicate values (np.mean is the default)*
Flatten the columns of a pivoted DataFrame
dataframename.reset_index()
What string method can you use to split a string at a delimiter?
.split(‘delimiter’)
Default delimiter is space
Which method can you use to retrieve a a value from a list?
listname.str.get(index)
How do you concatenate rows of data?
concatenated = pd.concat([dataframe1, dataframe2])
How do you concatenate columns of data?
concatenated = pd.concat([dataframe1, dataframe2], axis=1)
#axis = 1 defines column-wise concatenation. Default is axis=0
How do you merge data in cases where concatenating won’t work?
left_on and right_on define the common columns (keys) between the two dataframes
pd.merge(left=dataframe1, right=dataframe2, on=None, left_on=’keydf1’, right_on=’keydf2’)
# on= if the common columns have same name, can use on= to specify. If names are different, can omit this
How do you view the datatypes of the columns in a DataFrame?
print(dataframename.dtypes)
How do you convert a DataFrame column into string (object dtype)
dataframename[‘column1’] = dataframename[‘column1’].astype(str)
How to you convert a DataFrame column into category dtype?
Why might it be beneficial to use this data type?
dataframename[‘column1’] = dataframename[‘column1].astype(‘category’)
Saves memory
How do you convert a column containing missing values described by non-numeric characters into numeric data?
dataframename[‘column1’] = pd.to_numeric(dataframename[‘column1’], errors = ‘coerce’)
#errors = ‘coerce’ tells python to turn invalid values (e.g. a dash) to missing (NaN)
How do you use regular expressions to match the pattern of a string?
import re
pattern = re.compile(‘pattern’)
result = pattern.match(‘string’)
bool(result) outputs True
or
re.match(pattern=’pattern’, string=’string’)
How do you extract multiple numbers from a string using regular expressions?
matches = re.findall(‘pattern’, ‘string’)
How do you apply a function to all columns of a DataFrame?
dataframename.apply(functionname,axis=0)
How do you apply a function to all rows of a DataFrame?
dataframename.apply(functionname,axis=1)