Cleaning Data Flashcards

Learn the basics of Data Analysis

1
Q

5 steps of data analysis

A

S1. Ask questions
S2. Wrangle data - gather, assess, clean
S3. Perform EDA (exploratory data analysis)
S4. Draw conclusions
S5. Communicate your results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

parts of .read_csv()

A

filepath = filepath of the csv
header = header of the CSV file
index_col = which columns should be index
sep = separator

ex.
labels = [‘id’,’name’,’scores’]
df = pd.read_csv(‘scores.csv’, sep=’;’, header=None, names=labels, index_col=’Name’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

.shape

A

returns the overall size/shape of the dataframe: (462, 32)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

.dtypes

A

return each datatype of the columns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

.info()

A

returns
column, non-null count, and Dtype

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

.nunique()

A

returns number of unique values for each column

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

.describe()

A

returns summary statistics: count, mean, std, min, max, Q1 and Q3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

.head()

A

returns first few lines of Dataframe
df.head(5)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

.tail()

A

returns last few lines of Dataframe
df.tail(2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

A. .iloc
B. .iloc[0]
C. .iloc[0:5]
D. .iloc[:,1]
E. .iloc[0:5,1:3]
F. radi_means = .iloc[0:3,2]

A

A. a method in pandas used for integer-location based on indexing
B. Selects the first row
C. Selects the first five rows
D. Selects the second column (index 1)
E. Selects the first five rows and the second and third columns
F. Outputs first three rows of column three.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

A. .loc
B. .loc[‘row_index’]
C. .loc[[‘row_label’,’row_label2’]]
D. .loc[:, ‘column_label’]
E. .loc[‘row_label1’:’row_label3’,[‘col_lab1’,’column_lab2’]]

A

A. A method in pandas used for label-based indexing.
B. Selects the row with the specified label
C. Selects multiple rows
D. Selects all rows for the specified column
E. Selects a range of rows and specific columns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

A. Get the diagnosis of the patient with id 842302
B. Get the radius_mean for patients with id 842302 and 842517

A

A. diagnosis_patient = df.loc[df[‘id’]==842302,’diagnosis’]
B. radius_means = df.loc[df[‘id’].isin([842302,842517]), ‘radius_mean’] #Output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Visualize how to create a subset of a dataframe

A

make a copy of the data frame with relevant columns

copy_df = df[[‘col_1’,’col_2’,’col_3’,col_4’]]

subset = df[df[‘col_1’] == ‘data’]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Visualize how to use subsets in dropping duplicates, filling NaNs, and .isin() method.

A

A. Dropping duplicates
unique_df = df.drop_duplicates(subset=[‘col1’, ‘col2’])
B. Fill missing values
df.fillna(value=0, subset=[‘col3’], inplace=True)
C. Using pd.DataFrame.isin()
filtered_df = df[df[‘col1’].isin([‘data1’,’data2’])]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Fill null values in your DataFrame with values

A

.fillna()

ex.
1. df[‘col1’] = df[‘col1’].fillna(pd.to_datetime(‘1900-01-01’)
2. mean_date = df[‘col1’].mean()
df[‘col1’] = df[‘col1’].fillna(mean_date)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

A. Using .query() to discover null values in rows of a column
B. Using .query() for all rows with null values

A

A.
1. df_null = df[df[‘col1’].isnull()]
2. df_null = df.query(‘col1.isnull()’)
B.
df_null = df[df.isnull().any(axis=1)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Clean data

A

A. .duplicated() - True or false for rows that are duplicates = df.duplicated() and then .drop_duplicates(): df = df.drop_duplicates()
B. Converting to right dtypes: .to_datetime(): df[‘time’] = pd.to_datetime(df[‘time’])
C. .isnull(): df[‘col’].isnull()
D. .fillna() : df[‘col’] = df[‘col’].fillna(mean)
E. Correcting column names
F. Merging dataframes
G. Saving the cleaned version of the dataframes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Drop extraneous columns

A

confirm changes

df_copy = df.drop([‘col1’,’col2’,’col3’],axis=1, errors=’ignore’)

df_copy.head(2)

axis=0: refers to rows, axis=1: refers to columns

18
Q

A. Renaming columns with a library
B. renaming columns with a lambda function

A

A. df.rename(columns={‘Col 1’: ‘col_1’, ‘Col 2’: ‘col_2’, ‘Col 3’:’col_3’ })
B. df.rename(columns=lambda x: x.strip().lower().replace(“ “, “_”), inplace=True)

19
Q

Visualize a lambda function that renames columns by removing trailing spaces, converting characters to lowercase, and replacing spaces with underscores

A

df.rename(columns=lambda x: x.strip().lower().replace(“ “,”_”), inplace=True)

20
Q

Visualize panda query examples
Example A
Example B
Example C
Example D

A

With brackets
A. df = df[df[“col1”] == “data”]
C. df = df[df.col1 < 3]
With .query()
B. df = df.query(“col1” == “data”’)
D. df = df.query(“col1 < 3”)

21
Q

A. Visualize how to count total null values across all columns
B. Visualize how to get null values in each column

A

A. total_nulls = df.isnull().sum().sum()
print(total_nulls)
B. null_col_counts = df.isnull().sum()
print(null_col_counts)

22
Q

Visualize an if statement that drops rows with null values

A

if df.isnull().any().any():
df_cleaned = df[~df.isnull().any(axis=1)]
print(“Rows with null values have been dropped”)
else:
df_cleaned = df
print(“No null values found.”)

23
Q

Visualize how to count duplicates and then drop them.

A

A. df.duplicated().sum()
B. df = df.drop_duplicates()

24
Q

How do you save a dataframe?

A

df.to_csv(‘filepath/filename.csv’, index=False)

25
Q

Name the pandas data types

A

Objects, integers, floats, booleans, datetimes, and categorical

26
Q

Visualize the two main ways to convert dtypes

A

A.df[‘col1’] = df[‘col1’].astype(‘int32’)
B. df = df.astype({‘col1’: ‘int32’, ‘col2’: ‘int32’, ‘col3’: ‘object’})

27
Q

What is Regex or Regular Expressions?

A

A tool for pattern matching in strings that allows you to or replace parts of strings based on defined rules. Some basic components are: ., *, +, ?, [], (), \d, \w, \s

28
Q

Visualize the example of using a lambda function to split column data when it appears like 18/25

A

Split the values into separate rows

Original DataFrame
data = {‘city_mpg’: [‘13/18’, ‘19/25’], ‘hwy_mpg’: [‘18/22’, ‘25/30’]}
split_columns = [‘city_mpg’, ‘hwy_mpg’]
for col in split_columns:
df1[col] = df1[col].apply(lambda x: x.split(“/”)[0] if isinstance(x, str) else x)
df2[col] = df2[col].apply(lambda x: x.split(“/”)[1] if isinstance(x, str) and len(x.split(“/”)) > 1 else x)
#Concatenate df1 and df2 into a single DataFrame
expanded_df = pd.concat([df1, df2], ignore_index=True)

29
Q

Visualize how to use a for loop to change dtypes in multiple columns.

A

columns = [‘col1’, ‘col2’, ‘col3’]
for x in columns:
df[x] = df[x].astype(‘dtype’)

30
Q

Visualize how to use a function to change dtypes in multiple dataframes and columns

A

def change_column_dtype(dataframes, columns, dtype):
for df in dataframes:
for col in columns:
df[col] = df[col].astype(dtype)

Usage
columns = [‘col1’, ‘col2, ‘col3’]
change_column_dtype([df1, df2], columns, ‘float’)

31
Q

Visualize how to optimize data types

A

df[‘col1’].astype(“int8”)

32
Q

When should you use categorical data types for data optimization?

A

When you have object columns with a low number of unique values.

ex.
df[‘col1’].astype(“category”)

33
Q

What is the categorical data type?

A

Data that is limited with a fixed number of possible values. Categorical data is particularly useful when you have variables that represent groups such as gender, color, or type of vehicle.

34
Q

What is pd.concat([df1,df2])?

A

Primarily used to stack DataFrames either vertically (rows) or horizontally (columns), combining them by axis. It’s often used when you want to extend a DataFrame by adding rows or columns from another DataFrame without complex merging logic.

result = pd.concat([df1, df2], axis=0, ignore_index=True)

35
Q

What is pd.merge()

A

Primarily used for database-style joins (similar to SQL joins), which means merging DataFrames based on one or more common columns or indices. This is useful for combining related data with specific relationships, such as inner, outer, left, and right joins.

result = pd.merge(df1, df2, on=’key’, how=’inner’)

36
Q

How does inner_merge = pd.merge(df1, df2, on=’key’, how=’inner’) work?

A

Keeps only rows where keys are present in both DataFrames.

37
Q

Visualize outer_merge = pd.merge(df1, df2, on=’key’, how=’outer’)

A

Keeps all rows from both DataFrames, filling with NaN for missing values where no match is found.

38
Q

Visualize left_merge = pd.merge(df1, df2, on=’key’, how=’left’)

A

Keeps all rows from the left DataFrame, with matches from the right DataFrame where available. Non-matching entries from the right df2 are NAN

39
Q

Visualize right_merge = pd.merge(df1, df2, on=’key’, how=’right’)

A

Keeps all rows from the right DataFrame, with matches from the left DataFrame where available. Non-matching entries from the left df1 are NAN

40
Q

Visualize how to use pd.merge() on index

A

pd.merge(df1, df2, left_index=True, right_index=True, how=”inner”)

41
Q

Visualize how to calculate the mean mpg for the below DataFrame

data = {
‘model’: [‘Car A’, ‘Car B’, ‘Car A’, ‘Car B’, ‘Car C’],
‘cmb_mpg_2008’: [25, 30, 25, 30, 20],
‘cmb_mpg’: [30, 35, 32, 36, 25]
}

A

model_mpg = df_combined.groupby(‘model’).agg(
mean_cmb_mpg_2008=(‘cmb_mpg_2008’, ‘mean’),
mean_cmb_mpg=(‘cmb_mpg’, ‘mean’).reset_index()
groupby(‘model’): groups the dataframe by the model column
agg(): Calculates the mean for cmb_mpg_2008 and cmb_mpg.
Using agg, you can apply different aggregation functions (like sum, median, etc.) to different columns if needed, adding flexibility for additional metrics
reset_index(): Resets the index of the resulting DF to make it easier to work with.

42
Q

A. Visualize how to aggregate totals on multiple columns in a dataframe
B. Visualize how to aggregate totals using groupby.
C. Visualize how to sort values after aggregating with groupby.

A

A. df[[‘col1’,’col2’]].sum()
B. df.groupby(‘col’).sum(numeric_only=True)
C. df.groupby(‘col’).sum(numeric_only=True).sort_values(by=”col”, ascending=False)

43
Q

What .reset_index()

A

After merging a dataframe, this is used to convert the index levels back so that you can use standard DataFrame column access methods and maintain consistency.