Data Cleaning Flashcards
Name 4 of the most common data cleaning techniques in terms of most important to least important
-Checking for null values
-Checking for duplicates
-Checking for datatypes
-Checking for outliers
Name 5 of the most used datatypes
-int64 (Numeric/Integer)
-Object (Character/String)
-float64 (Numeric/Decimal)
-bool (Binary item/Boolean)
-datetime (time values)
What does it mean for us to check our datatypes in data cleaning?
It means to verify whether the values within the column are accurately assigned, such as, if it says object but is actually storing only int values for that column, the assignment of an object datatype would be incorrect
What does it mean to check for inconsistencies in a dataset?
It means that we are looking to verify that the unique values within a column are consistent, take for instance a country column but displays the United States of America as both the US and United States within the same column to communicate the same country
When do we impute by the median?
We impute by the median in a skewed distribution
What are all the techniques we can use to impute missing values in a dataset?
-Univariate Imputation (Imputing by the mean, median, mode)
-Backward fill and forward fill imputation (Usually time sequence data associated with date or time)
-Imputing using moving average or sliding window values (Usually time sequence data associated with date or time)
-Imputing by specific value if we’re confident in what the value will be
-Imputation by supervised learning models(KNN, linear regression, Random forest, Decision Trees, etc)
When is it appropriate to delete rows in a dataset?
-If , for a given column, a number of missing values is less than 5% of the number of rows in a dataset
When is it appropriate to delete columns in a dataset?
-If 50% or more for a given column is missing, best practice suggests the column to be deleted
What is the code/function to determine the proportion or rows missing in a column?
variablepercentage = round(df.variable.isnull().sum()/len(df)*100,2)
print(f”{variablepercentage}%”)
What function can be used to remove duplicates in a dataset?
df.drop_duplicates( )
What function can be used to check how many null values are in a dataset?
df.isnull( ).sum( )
What is the function to fill null values for a specific variable while imputing using the median?
df[‘Variable’].fillna(df[‘Variable’].median(),inplace=True)
What function is used to check for duplicated values?
df.duplicated( )
The result of the rows will show True or false for all rows.
What function is used to check for duplicated values across all variables/per row?
df[df.duplicated( )]
How do we remove whitespace in strings?
df[‘Variable’].str.strip( )