T3: Data cleaning Flashcards

Question 1

Q

WHY DOES YOUR DATA HAVE
ERRORS

Answer

A

1) Data entry errors
2) Measurement errors
3) Distillation errors
4) Data integration errors

Question 2

Q

WHAT TO DO ABOUT THEM

Answer

A

Data entry interface design (before)
* Organizational management (before)
* Automated data auditing and cleaning (after)*
* Exploratory data analysis and cleaning (after)
* Not the focus of today, but read about it.

Question 3

Q

EXPLORATORY DATA ANALYSIS

Answer

A

Quantitative data – e.g., outlier detection
Categorical data – e.g., managing misspellings and typos,
mapping different categories to a uniform namespace
Identifiers (IDs) – e.g., inconsistency detection

Question 4

Q

HOW TO START: 1) Explore your data:

Answer

A

Check out your data structure, type of variables, max/min,
categories
Look at how many missing values (NULLs, NaNs, 0s) you
have on each column
Look at your distributions (for cont. variables)
Look at your frequencies (for discrete variables)
Do they make sense?

Question 5

Q

OUTLIERS

Answer

A

2) For continuous variables – are there any outliers? i.e. Some
values that are far from the center? i.e., your average age is 36,
but you also have observations with age 150, while the second
highest observation is 67. You can get a sense of this from the
descriptive statistics table (i.e., if you look at min/max and
average values).

In R you can check this two ways > boxplot and outlier test.

Question 6

Q

WHAT TO DO WITH MISSING DATA

Answer

A

1) Test whether it is missing at random

2) Consider:
* Do you want to impute it, either using an average or median
* Should you delete the observation entirely
* Should you simply not use that column

Concretely, you have to identify (read more in Chapter 5, 5.4, in Business Analytics: Data Science for Business Problems)
1. Which features have missing values;
2. The extent of the missingness;
3. The reasons for the missingness; and
4. What to do about them.

Question 7

Q

NON-CONTINUOUS DATA

Answer

A

IDs- integrity check – you should not have any duplicate IDs…

Categorical variables – check for unusually high frequencies – it could be a data
error, but even if it is not, this might impact subsequent data analysis, e.g., if
99% of your respondents are students, vs full time employees, this might not
be a good target variable/feature to include later

Question 8

Q

CLOSING REMARKS

Answer

A

Document/document/document
NEVER save on top of the original data. That being said, you might not
always want to repeat the entire data cleaning process, so good practice
does say you should save your cleaned up dataset in your workspace once
you are happy with the final product.
Data cleaning is tedious. The only more tedious than data cleaning is doing
data cleaning after you have dedicated 2 days to fitting a really cool model
only to realize your results are gibberish.

Question 9

Q

%>%

Answer

A

It’s called a pipe.. Read about it here:

https://www.datacamp.com/tutorial/pipe-r-tutorial#intro

Question 10

Q

TIDYR

Answer

A

1 Each variable is saved in its own column

2 Each observation is saved in its own row

3 Each “type” of observation stored in a single table (here, storms)

Question 11

Q

Gather ()

Answer

A

Collapses multiple columns into two columns

1 A key column that contains the former column names

2 A value column that contains the former column cells

gather (cases, “year”, “n”, 2:4)

cases = data frame to reshape
year = name of the key column (a character string)
n = name of the new value column (a character string)
2:4 = names or numeric indexes of columns to collapse

Question 12

Q

Spread () vs gather ()

Answer

A

Make observations from variables with gather()

Make variables from observations with spread()

Question 13

Q

TIDYR

Answer

A

A package that reshapes the layout of data sets

Make observations from variables with gather()

Make variables from observations with spread()

Split and merge column with unite() and separate()

T3: Data cleaning Flashcards

(13 cards)