Cleaning Data Flashcards
What is data cleansing?
Removing/Fixing problem data
What are some example data problems?
Duplicates Empty rows Abbreviations Scale inconsistency Typos Missing values Trailing spaces Incomplete cells
If there is ______ data in the data set, we can fix, delete, etc as appropriate
invalid
An ______ is a data point that’s far away from others
outlier
Outliers can occur in ______data and should not immediately be thrown away
valid
How can we identify outliers?
Look at how for a point is from the mea, often how many standard deviations away from the mean
What’s the first thing to do when trying to find outliers?
Plot the data and look for problems by eye
What are 4 things we can do with outliers?
Leave it
Remove it
Remove the value and treat is as a missing value
Impute it
What is imputation?
Replacing values with a calculated plausible value
What are some drawbacks of imputation?
Can introduce bias
Introduce fake trends
Wash out real trends
What are common ways to impute?
Average nearby values
Average of known data
Linear regression of nearby values
What is entity resolution or record linkage?
The process of finding multiple values that actually refer to the same entity