Data Cleaning Flashcards

1
Q

Name 4 of the most common data cleaning techniques in terms of most important to least important

A

-Checking for null values
-Checking for duplicates
-Checking for datatypes
-Checking for outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Name 5 of the most used datatypes

A

-int64 (Numeric/Integer)
-Object (Character/String)
-float64 (Numeric/Decimal)
-bool (Binary item/Boolean)
-datetime (time values)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does it mean for us to check our datatypes in data cleaning?

A

It means to verify whether the values within the column are accurately assigned, such as, if it says object but is actually storing only int values for that column, the assignment of an object datatype would be incorrect

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does it mean to check for inconsistencies in a dataset?

A

It means that we are looking to verify that the unique values within a column are consistent, take for instance a country column but displays the United States of America as both the US and United States within the same column to communicate the same country

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

When do we impute by the median?

A

We impute by the median in a skewed distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are all the techniques we can use to impute missing values in a dataset?

A

-Univariate Imputation (Imputing by the mean, median, mode)
-Backward fill and forward fill imputation (Usually time sequence data associated with date or time)
-Imputing using moving average or sliding window values (Usually time sequence data associated with date or time)
-Imputing by specific value if we’re confident in what the value will be
-Imputation by supervised learning models(KNN, linear regression, Random forest, Decision Trees, etc)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

When is it appropriate to delete rows in a dataset?

A

-If , for a given column, a number of missing values is less than 5% of the number of rows in a dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

When is it appropriate to delete columns in a dataset?

A

-If 50% or more for a given column is missing, best practice suggests the column to be deleted

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the code/function to determine the proportion or rows missing in a column?

A

variablepercentage = round(df.variable.isnull().sum()/len(df)*100,2)
print(f”{variablepercentage}%”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What function can be used to remove duplicates in a dataset?

A

df.drop_duplicates( )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What function can be used to check how many null values are in a dataset?

A

df.isnull( ).sum( )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the function to fill null values for a specific variable while imputing using the median?

A

df[‘Variable’].fillna(df[‘Variable’].median(),inplace=True)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What function is used to check for duplicated values?

A

df.duplicated( )

The result of the rows will show True or false for all rows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What function is used to check for duplicated values across all variables/per row?

A

df[df.duplicated( )]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How do we remove whitespace in strings?

A

df[‘Variable’].str.strip( )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

If you want to change stringed values to make them all lowercase or uppercase, what function do you use?

A

df[‘Variable’].str.lower( )

OR

df[‘Variable’].str.upper( )

OR

df[‘Variable’].str.title( )