Data Cleaning Flashcards
Dataset must have these 5 characteristics
Validity
Accuracy
Completeness
Consistency
Uniformity
Four common issues revolving Data Cleaning
Unnecessary Data
Missing Data
Irregular Data
Inconsistent Data
Three types of Unnecessary Data
Uninformative/Repetitive
Irrelevant
Duplicates
PII
Should be removed; any information that can identify an individual user
3 ways to classify missing data
Missing Completely at Random
Missing at Random
Missing not at Random
Missing Completely at Random
no relationship between missing data and data
Missing at Random
missing data is related to the observed but not unobserved data
Missing not at Random
Missing Data is related to the unobserved data, not the observed
How can you handle missing data?
Drop feature
Impute the value
Flag the missing info
Irregular Data
outliers - found with IQR rule
data contradicting business or domain knowledge. findings and data from reliable sources, or intuition
Inconsistent Data
data not in consistent form or syntax,
Capitalization
Formats
Misspelled words