T3: Data cleaning Flashcards

1
Q

WHY DOES YOUR DATA HAVE
ERRORS

A

1) Data entry errors
2) Measurement errors
3) Distillation errors
4) Data integration errors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

WHAT TO DO ABOUT THEM

A

Data entry interface design (before)
* Organizational management (before)
* Automated data auditing and cleaning (after)*
* Exploratory data analysis and cleaning (after)
* Not the focus of today, but read about it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

EXPLORATORY DATA ANALYSIS

A
  • Quantitative data – e.g., outlier detection
  • Categorical data – e.g., managing misspellings and typos,
    mapping different categories to a uniform namespace
  • Identifiers (IDs) – e.g., inconsistency detection
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

HOW TO START: 1) Explore your data:

A
  • Check out your data structure, type of variables, max/min,
    categories
  • Look at how many missing values (NULLs, NaNs, 0s) you
    have on each column
  • Look at your distributions (for cont. variables)
  • Look at your frequencies (for discrete variables)
  • Do they make sense?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

OUTLIERS

A

2) For continuous variables – are there any outliers? i.e. Some
values that are far from the center? i.e., your average age is 36,
but you also have observations with age 150, while the second
highest observation is 67. You can get a sense of this from the
descriptive statistics table (i.e., if you look at min/max and
average values).

In R you can check this two ways > boxplot and outlier test.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

WHAT TO DO WITH MISSING DATA

A

1) Test whether it is missing at random

2) Consider:
* Do you want to impute it, either using an average or median
* Should you delete the observation entirely
* Should you simply not use that column

Concretely, you have to identify (read more in Chapter 5, 5.4, in Business Analytics: Data Science for Business Problems)
1. Which features have missing values;
2. The extent of the missingness;
3. The reasons for the missingness; and
4. What to do about them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

NON-CONTINUOUS DATA

A

IDs- integrity check – you should not have any duplicate IDs…

Categorical variables – check for unusually high frequencies – it could be a data
error, but even if it is not, this might impact subsequent data analysis, e.g., if
99% of your respondents are students, vs full time employees, this might not
be a good target variable/feature to include later

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

CLOSING REMARKS

A
  • Document/document/document
  • NEVER save on top of the original data. That being said, you might not
    always want to repeat the entire data cleaning process, so good practice
    does say you should save your cleaned up dataset in your workspace once
    you are happy with the final product.
  • Data cleaning is tedious. The only more tedious than data cleaning is doing
    data cleaning after you have dedicated 2 days to fitting a really cool model
    only to realize your results are gibberish.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

%>%

A

It’s called a pipe.. Read about it here:

https://www.datacamp.com/tutorial/pipe-r-tutorial#intro

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

TIDYR

A

1 Each variable is saved in its own column

2 Each observation is saved in its own row

3 Each “type” of observation stored in a single table (here, storms)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Gather ()

A

Collapses multiple columns into two columns

1 A key column that contains the former column names

2 A value column that contains the former column cells

gather (cases, “year”, “n”, 2:4)

cases = data frame to reshape
year = name of the key column (a character string)
n = name of the new value column (a character string)
2:4 = names or numeric indexes of columns to collapse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Spread () vs gather ()

A

Make observations from variables with gather()

Make variables from observations with spread()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

TIDYR

A

A package that reshapes the layout of data sets

Make observations from variables with gather()

Make variables from observations with spread()

Split and merge column with unite() and separate()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly