T3: Data cleaning Flashcards
WHY DOES YOUR DATA HAVE
ERRORS
1) Data entry errors
2) Measurement errors
3) Distillation errors
4) Data integration errors
WHAT TO DO ABOUT THEM
Data entry interface design (before)
* Organizational management (before)
* Automated data auditing and cleaning (after)*
* Exploratory data analysis and cleaning (after)
* Not the focus of today, but read about it.
EXPLORATORY DATA ANALYSIS
- Quantitative data – e.g., outlier detection
- Categorical data – e.g., managing misspellings and typos,
mapping different categories to a uniform namespace - Identifiers (IDs) – e.g., inconsistency detection
HOW TO START: 1) Explore your data:
- Check out your data structure, type of variables, max/min,
categories - Look at how many missing values (NULLs, NaNs, 0s) you
have on each column - Look at your distributions (for cont. variables)
- Look at your frequencies (for discrete variables)
- Do they make sense?
OUTLIERS
2) For continuous variables – are there any outliers? i.e. Some
values that are far from the center? i.e., your average age is 36,
but you also have observations with age 150, while the second
highest observation is 67. You can get a sense of this from the
descriptive statistics table (i.e., if you look at min/max and
average values).
In R you can check this two ways > boxplot and outlier test.
WHAT TO DO WITH MISSING DATA
1) Test whether it is missing at random
2) Consider:
* Do you want to impute it, either using an average or median
* Should you delete the observation entirely
* Should you simply not use that column
Concretely, you have to identify (read more in Chapter 5, 5.4, in Business Analytics: Data Science for Business Problems)
1. Which features have missing values;
2. The extent of the missingness;
3. The reasons for the missingness; and
4. What to do about them.
NON-CONTINUOUS DATA
IDs- integrity check – you should not have any duplicate IDs…
Categorical variables – check for unusually high frequencies – it could be a data
error, but even if it is not, this might impact subsequent data analysis, e.g., if
99% of your respondents are students, vs full time employees, this might not
be a good target variable/feature to include later
CLOSING REMARKS
- Document/document/document
- NEVER save on top of the original data. That being said, you might not
always want to repeat the entire data cleaning process, so good practice
does say you should save your cleaned up dataset in your workspace once
you are happy with the final product. - Data cleaning is tedious. The only more tedious than data cleaning is doing
data cleaning after you have dedicated 2 days to fitting a really cool model
only to realize your results are gibberish.
%>%
It’s called a pipe.. Read about it here:
https://www.datacamp.com/tutorial/pipe-r-tutorial#intro
TIDYR
1 Each variable is saved in its own column
2 Each observation is saved in its own row
3 Each “type” of observation stored in a single table (here, storms)
Gather ()
Collapses multiple columns into two columns
1 A key column that contains the former column names
2 A value column that contains the former column cells
gather (cases, “year”, “n”, 2:4)
cases = data frame to reshape
year = name of the key column (a character string)
n = name of the new value column (a character string)
2:4 = names or numeric indexes of columns to collapse
Spread () vs gather ()
Make observations from variables with gather()
Make variables from observations with spread()
TIDYR
A package that reshapes the layout of data sets
Make observations from variables with gather()
Make variables from observations with spread()
Split and merge column with unite() and separate()