Cleaning Data Flashcards
Learn the basics of Data Analysis
5 steps of data analysis
S1. Ask questions
S2. Wrangle data - gather, assess, clean
S3. Perform EDA (exploratory data analysis)
S4. Draw conclusions
S5. Communicate your results
parts of .read_csv()
filepath = filepath of the csv
header = header of the CSV file
index_col = which columns should be index
sep = separator
ex.
labels = [‘id’,’name’,’scores’]
df = pd.read_csv(‘scores.csv’, sep=’;’, header=None, names=labels, index_col=’Name’)
.shape
returns the overall size/shape of the dataframe: (462, 32)
.dtypes
return each datatype of the columns
.info()
returns
column, non-null count, and Dtype
.nunique()
returns number of unique values for each column
.describe()
returns summary statistics: count, mean, std, min, max, Q1 and Q3
.head()
returns first few lines of Dataframe
df.head(5)
.tail()
returns last few lines of Dataframe
df.tail(2)
A. .iloc
B. .iloc[0]
C. .iloc[0:5]
D. .iloc[:,1]
E. .iloc[0:5,1:3]
F. radi_means = .iloc[0:3,2]
A. a method in pandas used for integer-location based on indexing
B. Selects the first row
C. Selects the first five rows
D. Selects the second column (index 1)
E. Selects the first five rows and the second and third columns
F. Outputs first three rows of column three.
A. .loc
B. .loc[‘row_index’]
C. .loc[[‘row_label’,’row_label2’]]
D. .loc[:, ‘column_label’]
E. .loc[‘row_label1’:’row_label3’,[‘col_lab1’,’column_lab2’]]
A. A method in pandas used for label-based indexing.
B. Selects the row with the specified label
C. Selects multiple rows
D. Selects all rows for the specified column
E. Selects a range of rows and specific columns
A. Get the diagnosis of the patient with id 842302
B. Get the radius_mean for patients with id 842302 and 842517
A. diagnosis_patient = df.loc[df[‘id’]==842302,’diagnosis’]
B. radius_means = df.loc[df[‘id’].isin([842302,842517]), ‘radius_mean’] #Output
Visualize how to create a subset of a dataframe
make a copy of the data frame with relevant columns
copy_df = df[[‘col_1’,’col_2’,’col_3’,col_4’]]
subset = df[df[‘col_1’] == ‘data’]
Visualize how to use subsets in dropping duplicates, filling NaNs, and .isin() method.
A. Dropping duplicates
unique_df = df.drop_duplicates(subset=[‘col1’, ‘col2’])
B. Fill missing values
df.fillna(value=0, subset=[‘col3’], inplace=True)
C. Using pd.DataFrame.isin()
filtered_df = df[df[‘col1’].isin([‘data1’,’data2’])]
Fill null values in your DataFrame with values
.fillna()
ex.
1. df[‘col1’] = df[‘col1’].fillna(pd.to_datetime(‘1900-01-01’)
2. mean_date = df[‘col1’].mean()
df[‘col1’] = df[‘col1’].fillna(mean_date)
Visualize how to query to discover null values in rows of a column
- df_null = df[df[‘col1’].isnull()]
- df_null = df.query(‘col1.isnull()’)
Visualize creating a query for all rows with null values
df_null = df[df.isnull().any(axis=1)]
Clean data
A. .duplicated() - True or false for rows that are duplicates = df.duplicated() and then .drop_duplicates(): df = df.drop_duplicates()
B. Converting to right dtypes: .to_datetime(): df[‘time’] = pd.to_datetime(df[‘time’])
C. .isnull(): df[‘col’].isnull()
D. .fillna() : df[‘col’] = df[‘col’].fillna(mean)
E. Correcting column names
F. Merging dataframes
G. Saving the cleaned version of the dataframes.
Drop extraneous columns
confirm changes
df_copy = df.drop([‘col1’,’col2’,’col3’],axis=1, errors=’ignore’)
df_copy.head(2)
axis=0: refers to rows, axis=1: refers to columns
A. Renaming columns with a library
B. renaming columns with a lambda function
A. df.rename(columns={‘Col 1’: ‘col_1’, ‘Col 2’: ‘col_2’, ‘Col 3’:’col_3’ })
B. df.rename(columns=lambda x: x.strip().lower().replace(“ “, “_”), inplace=True)
Visualize a lambda function that renames columns by removing trailing spaces, converting characters to lowercase, and replacing spaces with underscores
df.rename(columns=lambda x: x.strip().lower().replace(“ “,”_”), inplace=True)
Visualize panda query examples
Example A
Example B
Example C
Example D
With brackets
A. df = df[df[“col1”] == “data”]
C. df = df[df.col1 < 3]
With .query()
B. df = df.query(“col1” == “data”’)
D. df = df.query(“col1 < 3”)
A. Visualize how to count total null values across all columns
B. Visualize how to get null values in each column
A. total_nulls = df.isnull().sum().sum()
print(total_nulls)
B. null_col_counts = df.isnull().sum()
print(null_col_counts)
Visualize an if statement that drops rows with null values
if df.isnull().any().any():
df_cleaned = df[~df.isnull().any(axis=1)]
print(“Rows with null values have been dropped”)
else:
df_cleaned = df
print(“No null values found.”)
Visualize how to count duplicates and then drop them.
A. df.duplicated().sum()
B. df = df.drop_duplicates()
How do you save a dataframe?
df.to_csv(‘filepath/filename.csv’, index=False)
Name the pandas data types
Objects, integers, floats, booleans, datetimes, and categorical
Visualize the two main ways to convert dtypes
A.df[‘col1’] = df[‘col1’].astype(‘int32’)
B. df = df.astype({‘col1’: ‘int32’, ‘col2’: ‘int32’, ‘col3’: ‘object’})
What is Regex or Regular Expressions?
A library (re) for pattern matching in strings that allows you to or replace parts of strings based on defined rules. Some basic components are: ., *, +, ?, [], (), \d, \w, \s
Visualize the example of using a lambda function to split column data when it appears like 18/25
Split the values into separate rows
Original DataFrame
data = {‘city_mpg’: [‘13/18’, ‘19/25’], ‘hwy_mpg’: [‘18/22’, ‘25/30’]}
split_columns = [‘city_mpg’, ‘hwy_mpg’]
for col in split_columns:
df1[col] = df1[col].apply(lambda x: x.split(“/”)[0] if isinstance(x, str) else x)
df2[col] = df2[col].apply(lambda x: x.split(“/”)[1] if isinstance(x, str) and len(x.split(“/”)) > 1 else x)
#Concatenate df1 and df2 into a single DataFrame
expanded_df = pd.concat([df1, df2], ignore_index=True)
Visualize how to use a for loop to change dtypes in multiple columns.
columns = [‘col1’, ‘col2’, ‘col3’]
for x in columns:
df[x] = df[x].astype(‘dtype’)
Visualize how to use a function to change dtypes in multiple dataframes and columns
def change_column_dtype(dataframes, columns, dtype):
for df in dataframes:
for col in columns:
df[col] = df[col].astype(dtype)
Usage
columns = [‘col1’, ‘col2, ‘col3’]
change_column_dtype([df1, df2], columns, ‘float’)
Visualize how to optimize data types
df[‘col1’].astype(“int8”)
When should you use categorical data types for data optimization?
When you have object columns with a low number of unique values.
ex.
df[‘col1’].astype(“category”)
What is the categorical data type?
Data that is limited with a fixed number of possible values. Categorical data is particularly useful when you have variables that represent groups such as gender, color, or type of vehicle.
What is pd.concat([df1,df2])?
Primarily used to stack DataFrames either vertically (rows) or horizontally (columns), combining them by axis.
result = pd.concat([df1, df2], axis=0, ignore_index=True)
What is pd.merge()
Primarily used for database-style joins (similar to SQL joins), which means merging DataFrames based on one or more common columns or indices. This is useful for combining related data with specific relationships, such as inner, outer, left, and right joins.
result = pd.merge(df1, df2, on=’key’, how=’inner’)
How does inner_merge = pd.merge(df1, df2, on=’key’, how=’inner’) work?
Keeps only rows where keys are present in both DataFrames.
What will this formula result in:
outer_merge = pd.merge(df1, df2, on=’key’, how=’outer’)
Keeps all rows from both DataFrames, filling with NaN for missing values where no match is found.
Visualize left_merge = pd.merge(df1, df2, on=’key’, how=’left’)
Keeps all rows from the left DataFrame, with matches from the right DataFrame where available. Non-matching entries from the right df2 are NAN
Visualize right_merge = pd.merge(df1, df2, on=’key’, how=’right’)
Keeps all rows from the right DataFrame, with matches from the left DataFrame where available. Non-matching entries from the left df1 are NAN
Visualize how to use pd.merge() on index
pd.merge(df1, df2, left_index=True, right_index=True, how=”inner”)
Visualize how to calculate the mean mpg for the below DataFrame
data = {
‘model’: [‘Car A’, ‘Car B’, ‘Car A’, ‘Car B’, ‘Car C’],
‘cmb_mpg_2008’: [25, 30, 25, 30, 20],
‘cmb_mpg’: [30, 35, 32, 36, 25]
}
model_mpg = df_combined.groupby(‘model’).agg(
mean_cmb_mpg_2008=(‘cmb_mpg_2008’, ‘mean’),
mean_cmb_mpg=(‘cmb_mpg’, ‘mean’).reset_index()
groupby(‘model’): groups the dataframe by the model column
agg(): Calculates the mean for cmb_mpg_2008 column, making a new column titled mean_cmb_mpg_2008. Similarly is done for cmb_mpg.
Using agg, you can apply different aggregation functions (like sum, median, etc.) to different columns if needed, adding flexibility for additional metrics
reset_index(): Resets the index of the resulting DF to make it easier to work with.
A. Visualize how to aggregate totals on multiple columns in a dataframe
B. Visualize how to aggregate totals using groupby.
C. Visualize how to sort values after aggregating with groupby.
A. df[[‘col1’,’col2’]].sum()
B. df.groupby(‘col’).sum(numeric_only=True)
C. df.groupby(‘col’).sum(numeric_only=True).sort_values(by=”col”, ascending=False)
What .reset_index()
After merging a dataframe, this is used to convert the index levels back so that you can use standard DataFrame column access methods and maintain consistency.