Cleaning Data in Python Flashcards
Print first 5 rows of DataFrame
Print last 5 rows of DataFrame
print(dataframename.head())
print(dataframename.tail())
Get a list of columns of the DataFrame
dataframename.columns
Check dimensions of a DataFrame
dataframename.shape
Get information about DataFrame
dataframename.info()
#Shows number of rows and columns, column names, number of non-missing values in column, and type of data in each column
Count frequency in a DataFrame column (2 equivalent ways)
dataframename. column1.value_counts(dropna=False)
* #Can use this one if column1 name doesn’t contain any special characters or spaces and it’s not the name of a python function*
dataframename[‘column1’].value_counts(dropna=False)
#dropna=False will make it count number of missing values as well
chain together value_counts() and head() methods
dataframename.column1.value_counts(dropna=False).head()
Summary statistics for a DataFrame
dataframename.describe()
# Will produce summary statistics of numeric data including count (number of non-missing values), mean, std, min, 25%, 50% (median), 75%, max
Make a histogram from a DataFrame column
import matplotlib.pyplot as plt
dataframename. columnname.plot(kind=’hist’)
plt. show()
Make a boxplot from a DataFrame column
dataframename. boxplot(column=’columnname’, by=’groupingvariable’)
plt. show()
Make a scatter plot from two DataFrame columns
dataframe.plot(kind=’scatter’, x=column1, y=column2)
What are the 3 principles of Tidy Data
- Columns represent separate variables
- Rows represent individual observations
- Observational units form tables
Melt together two columns of a DataFrame
pd.melt(frame=datatframename, id_vars=’column1’, value_vars=[‘column2, ‘column3’], var_name=’columnname’, value_name=’valuesname’)
- # id_vars are the columns that you want to keep the same. Note that for multiple id_vars need to make a list[]*
- # value_vars are the columns that you want to melt. If none specified, all columns (except ones in id_vars) will melt*
- # var_name is the name of the melted column#value_name is the name for the column of values*
Pivot one column into two columns
new_df = dataframename.pivot(index=’column1’, columns=’column2’, values=’column3’)
- # index – columns that want to keep the same. For multiple entries, make a list[]*
- # columns – column that want to pivot into separate columns.*
Pivot one column into two columns when there are duplicate values for a variable you’re trying to pivot. Aggregate the duplicate values by taking their average
new_df = dataframename.pivot_table(index=’column1’, columns=’column2’, values=’column3’, aggfunc=np.mean)
- # index –columns that want to keep the same. For multiple entries, make a list[]*
- # columns - column we want to pivot into separate columns*
- # values – values that will be used to fill the columns after pivoting*
- # aggfunc – tells python how to handle duplicate values (np.mean is the default)*
Flatten the columns of a pivoted DataFrame
dataframename.reset_index()