Cleaning Data Flashcards

Question 1

Q

5 steps of data analysis

Answer

A

S1. Ask questions
S2. Wrangle data - gather, assess, clean
S3. Perform EDA (exploratory data analysis)
S4. Draw conclusions
S5. Communicate your results

Question 2

Q

parts of .read_csv()

Answer

A

filepath = filepath of the csv
header = header of the CSV file
index_col = which columns should be index
sep = separator

ex.
labels = [‘id’,’name’,’scores’]
df = pd.read_csv(‘scores.csv’, sep=’;’, header=None, names=labels, index_col=’Name’)

Question 3

Q

.shape

Answer

A

returns the overall size/shape of the dataframe: (462, 32)

Question 4

Q

.dtypes

Answer

A

return each datatype of the columns

Question 5

Q

.info()

Answer

A

returns
column, non-null count, and Dtype

Question 6

Q

.nunique()

Answer

A

returns number of unique values for each column

Question 7

Q

.describe()

Answer

A

returns summary statistics: count, mean, std, min, max, Q1 and Q3

Question 8

Q

.head()

Answer

A

returns first few lines of Dataframe
df.head(5)

Question 9

Q

.tail()

Answer

A

returns last few lines of Dataframe
df.tail(2)

Question 10

Q

A. .iloc
B. .iloc[0]
C. .iloc[0:5]
D. .iloc[:,1]
E. .iloc[0:5,1:3]
F. radi_means = .iloc[0:3,2]

Answer

A

A. a method in pandas used for integer-location based on indexing
B. Selects the first row
C. Selects the first five rows
D. Selects the second column (index 1)
E. Selects the first five rows and the second and third columns
F. Outputs first three rows of column three.

Question 11

Q

A. .loc
B. .loc[‘row_index’]
C. .loc[[‘row_label’,’row_label2’]]
D. .loc[:, ‘column_label’]
E. .loc[‘row_label1’:’row_label3’,[‘col_lab1’,’column_lab2’]]

Answer

A

A. A method in pandas used for label-based indexing.
B. Selects the row with the specified label
C. Selects multiple rows
D. Selects all rows for the specified column
E. Selects a range of rows and specific columns

Question 12

Q

A. Get the diagnosis of the patient with id 842302
B. Get the radius_mean for patients with id 842302 and 842517

Answer

A

A. diagnosis_patient = df.loc[df[‘id’]==842302,’diagnosis’]
B. radius_means = df.loc[df[‘id’].isin([842302,842517]), ‘radius_mean’] #Output

Question 13

Q

Visualize how to create a subset of a dataframe

Answer

A

make a copy of the data frame with relevant columns

copy_df = df[[‘col_1’,’col_2’,’col_3’,col_4’]]

subset = df[df[‘col_1’] == ‘data’]

Question 14

Q

Visualize how to use subsets in dropping duplicates, filling NaNs, and .isin() method.

Answer

A

A. Dropping duplicates
unique_df = df.drop_duplicates(subset=[‘col1’, ‘col2’])
B. Fill missing values
df.fillna(value=0, subset=[‘col3’], inplace=True)
C. Using pd.DataFrame.isin()
filtered_df = df[df[‘col1’].isin([‘data1’,’data2’])]

Question 15

Q

Fill null values in your DataFrame with values

Answer

A

.fillna()

ex.
1. df[‘col1’] = df[‘col1’].fillna(pd.to_datetime(‘1900-01-01’)
2. mean_date = df[‘col1’].mean()
df[‘col1’] = df[‘col1’].fillna(mean_date)

Question 16

Q

Visualize how to query to discover null values in rows of a column

Answer

A

df_null = df[df[‘col1’].isnull()]
df_null = df.query(‘col1.isnull()’)

Question 17

Q

Visualize creating a query for all rows with null values

Answer

A

df_null = df[df.isnull().any(axis=1)]

Question 18

Q

Clean data

Answer

A

A. .duplicated() - True or false for rows that are duplicates = df.duplicated() and then .drop_duplicates(): df = df.drop_duplicates()
B. Converting to right dtypes: .to_datetime(): df[‘time’] = pd.to_datetime(df[‘time’])
C. .isnull(): df[‘col’].isnull()
D. .fillna() : df[‘col’] = df[‘col’].fillna(mean)
E. Correcting column names
F. Merging dataframes
G. Saving the cleaned version of the dataframes.

Question 19

Q

Drop extraneous columns

Answer

A

confirm changes

df_copy = df.drop([‘col1’,’col2’,’col3’],axis=1, errors=’ignore’)

df_copy.head(2)

axis=0: refers to rows, axis=1: refers to columns

Question 20

Q

A. Renaming columns with a library
B. renaming columns with a lambda function

Answer

A

A. df.rename(columns={‘Col 1’: ‘col_1’, ‘Col 2’: ‘col_2’, ‘Col 3’:’col_3’ })
B. df.rename(columns=lambda x: x.strip().lower().replace(“ “, “_”), inplace=True)

Question 21

Q

Visualize a lambda function that renames columns by removing trailing spaces, converting characters to lowercase, and replacing spaces with underscores

Answer

A

df.rename(columns=lambda x: x.strip().lower().replace(“ “,”_”), inplace=True)

Question 22

Q

Visualize panda query examples
Example A
Example B
Example C
Example D

Answer

A

With brackets
A. df = df[df[“col1”] == “data”]
C. df = df[df.col1 < 3]
With .query()
B. df = df.query(“col1” == “data”’)
D. df = df.query(“col1 < 3”)

Question 23

Q

A. Visualize how to count total null values across all columns
B. Visualize how to get null values in each column

Answer

A

A. total_nulls = df.isnull().sum().sum()
print(total_nulls)
B. null_col_counts = df.isnull().sum()
print(null_col_counts)

Question 24

Q

Visualize an if statement that drops rows with null values

Answer

A

if df.isnull().any().any():
df_cleaned = df[~df.isnull().any(axis=1)]
print(“Rows with null values have been dropped”)
else:
df_cleaned = df
print(“No null values found.”)

Question 25

Q

Visualize how to count duplicates and then drop them.

Answer

A

A. df.duplicated().sum()
B. df = df.drop_duplicates()

Question 26

Q

How do you save a dataframe?

Answer

A

df.to_csv(‘filepath/filename.csv’, index=False)

Question 27

Q

Name the pandas data types

Answer

A

Objects, integers, floats, booleans, datetimes, and categorical

Question 28

Q

Visualize the two main ways to convert dtypes

Answer

A

A.df[‘col1’] = df[‘col1’].astype(‘int32’)
B. df = df.astype({‘col1’: ‘int32’, ‘col2’: ‘int32’, ‘col3’: ‘object’})

Question 29

Q

What is Regex or Regular Expressions?

Answer

A

A library (re) for pattern matching in strings that allows you to or replace parts of strings based on defined rules. Some basic components are: ., *, +, ?, [], (), \d, \w, \s

Question 30

Q

Visualize the example of using a lambda function to split column data when it appears like 18/25

Answer

A

Split the values into separate rows

Original DataFrame
data = {‘city_mpg’: [‘13/18’, ‘19/25’], ‘hwy_mpg’: [‘18/22’, ‘25/30’]}
split_columns = [‘city_mpg’, ‘hwy_mpg’]
for col in split_columns:
df1[col] = df1[col].apply(lambda x: x.split(“/”)[0] if isinstance(x, str) else x)
df2[col] = df2[col].apply(lambda x: x.split(“/”)[1] if isinstance(x, str) and len(x.split(“/”)) > 1 else x)
#Concatenate df1 and df2 into a single DataFrame
expanded_df = pd.concat([df1, df2], ignore_index=True)

Question 31

Q

Visualize how to use a for loop to change dtypes in multiple columns.

Answer

A

columns = [‘col1’, ‘col2’, ‘col3’]
for x in columns:
df[x] = df[x].astype(‘dtype’)

Question 32

Q

Visualize how to use a function to change dtypes in multiple dataframes and columns

Answer

A

def change_column_dtype(dataframes, columns, dtype):
for df in dataframes:
for col in columns:
df[col] = df[col].astype(dtype)

Usage
columns = [‘col1’, ‘col2, ‘col3’]
change_column_dtype([df1, df2], columns, ‘float’)

Question 33

Q

Visualize how to optimize data types

Answer

A

df[‘col1’].astype(“int8”)

Question 34

Q

When should you use categorical data types for data optimization?

Answer

A

When you have object columns with a low number of unique values.

ex.
df[‘col1’].astype(“category”)

Question 35

Q

What is the categorical data type?

Answer

A

Data that is limited with a fixed number of possible values. Categorical data is particularly useful when you have variables that represent groups such as gender, color, or type of vehicle.

Question 36

Q

What is pd.concat([df1,df2])?

Answer

A

Primarily used to stack DataFrames either vertically (rows) or horizontally (columns), combining them by axis.
result = pd.concat([df1, df2], axis=0, ignore_index=True)

Question 37

Q

What is pd.merge()

Answer

A

Primarily used for database-style joins (similar to SQL joins), which means merging DataFrames based on one or more common columns or indices. This is useful for combining related data with specific relationships, such as inner, outer, left, and right joins.

result = pd.merge(df1, df2, on=’key’, how=’inner’)

Question 38

Q

How does inner_merge = pd.merge(df1, df2, on=’key’, how=’inner’) work?

Answer

A

Keeps only rows where keys are present in both DataFrames.

Question 39

Q

What will this formula result in:
outer_merge = pd.merge(df1, df2, on=’key’, how=’outer’)

Answer

A

Keeps all rows from both DataFrames, filling with NaN for missing values where no match is found.

Question 40

Q

Visualize left_merge = pd.merge(df1, df2, on=’key’, how=’left’)

Answer

A

Keeps all rows from the left DataFrame, with matches from the right DataFrame where available. Non-matching entries from the right df2 are NAN

Question 41

Q

Visualize right_merge = pd.merge(df1, df2, on=’key’, how=’right’)

Answer

A

Keeps all rows from the right DataFrame, with matches from the left DataFrame where available. Non-matching entries from the left df1 are NAN

Question 42

Q

Visualize how to use pd.merge() on index

Answer

A

pd.merge(df1, df2, left_index=True, right_index=True, how=”inner”)

Question 43

Q

Visualize how to calculate the mean mpg for the below DataFrame

data = {
‘model’: [‘Car A’, ‘Car B’, ‘Car A’, ‘Car B’, ‘Car C’],
‘cmb_mpg_2008’: [25, 30, 25, 30, 20],
‘cmb_mpg’: [30, 35, 32, 36, 25]
}

Answer

A

model_mpg = df_combined.groupby(‘model’).agg(
mean_cmb_mpg_2008=(‘cmb_mpg_2008’, ‘mean’),
mean_cmb_mpg=(‘cmb_mpg’, ‘mean’).reset_index()
groupby(‘model’): groups the dataframe by the model column
agg(): Calculates the mean for cmb_mpg_2008 column, making a new column titled mean_cmb_mpg_2008. Similarly is done for cmb_mpg.
Using agg, you can apply different aggregation functions (like sum, median, etc.) to different columns if needed, adding flexibility for additional metrics
reset_index(): Resets the index of the resulting DF to make it easier to work with.

Question 44

Q

A. Visualize how to aggregate totals on multiple columns in a dataframe
B. Visualize how to aggregate totals using groupby.
C. Visualize how to sort values after aggregating with groupby.

Answer

A

A. df[[‘col1’,’col2’]].sum()
B. df.groupby(‘col’).sum(numeric_only=True)
C. df.groupby(‘col’).sum(numeric_only=True).sort_values(by=”col”, ascending=False)

Question 45

Q

What .reset_index()

Answer

A

After merging a dataframe, this is used to convert the index levels back so that you can use standard DataFrame column access methods and maintain consistency.

Cleaning Data Flashcards

Learn the basics of Data Analysis