Data Cleaning Flashcards

Question 1

Q

Name 4 of the most common data cleaning techniques in terms of most important to least important

Answer

A

-Checking for null values
-Checking for duplicates
-Checking for datatypes
-Checking for outliers

Question 2

Q

Name 5 of the most used datatypes

Answer

A

-int64 (Numeric/Integer)
-Object (Character/String)
-float64 (Numeric/Decimal)
-bool (Binary item/Boolean)
-datetime (time values)

Question 3

Q

What does it mean for us to check our datatypes in data cleaning?

Answer

A

It means to verify whether the values within the column are accurately assigned, such as, if it says object but is actually storing only int values for that column, the assignment of an object datatype would be incorrect

Question 4

Q

What does it mean to check for inconsistencies in a dataset?

Answer

A

It means that we are looking to verify that the unique values within a column are consistent, take for instance a country column but displays the United States of America as both the US and United States within the same column to communicate the same country

Question 5

Q

When do we impute by the median?

Answer

A

We impute by the median in a skewed distribution

Question 6

Q

What are all the techniques we can use to impute missing values in a dataset?

Answer

A

-Univariate Imputation (Imputing by the mean, median, mode)
-Backward fill and forward fill imputation (Usually time sequence data associated with date or time)
-Imputing using moving average or sliding window values (Usually time sequence data associated with date or time)
-Imputing by specific value if we’re confident in what the value will be
-Imputation by supervised learning models(KNN, linear regression, Random forest, Decision Trees, etc)

Question 7

Q

When is it appropriate to delete rows in a dataset?

Answer

A

-If , for a given column, a number of missing values is less than 5% of the number of rows in a dataset

Question 8

Q

When is it appropriate to delete columns in a dataset?

Answer

A

-If 50% or more for a given column is missing, best practice suggests the column to be deleted

Question 9

Q

What is the code/function to determine the proportion or rows missing in a column?

Answer

A

variablepercentage = round(df.variable.isnull().sum()/len(df)*100,2)
print(f”{variablepercentage}%”)

Question 10

Q

What function can be used to remove duplicates in a dataset?

Answer

A

df.drop_duplicates( )

Question 11

Q

What function can be used to check how many null values are in a dataset?

Answer

A

df.isnull( ).sum( )

Question 12

Q

What is the function to fill null values for a specific variable while imputing using the median?

Answer

A

df[‘Variable’].fillna(df[‘Variable’].median(),inplace=True)

Question 13

Q

What function is used to check for duplicated values?

Answer

A

df.duplicated( )

The result of the rows will show True or false for all rows.

Question 14

Q

What function is used to check for duplicated values across all variables/per row?

Answer

A

df[df.duplicated( )]

Question 15

Q

How do we remove whitespace in strings?

Answer

A

df[‘Variable’].str.strip( )

Question 16

Q

If you want to change stringed values to make them all lowercase or uppercase, what function do you use?

Answer

A

df[‘Variable’].str.lower( )

OR

df[‘Variable’].str.upper( )

OR

df[‘Variable’].str.title( )