Cleaning Data Flashcards

1
Q

What is data cleansing?

A

Removing/Fixing problem data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are some example data problems?

A
Duplicates
Empty rows
Abbreviations
Scale inconsistency
Typos
Missing values
Trailing spaces
Incomplete cells
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

If there is ______ data in the data set, we can fix, delete, etc as appropriate

A

invalid

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

An ______ is a data point that’s far away from others

A

outlier

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Outliers can occur in ______data and should not immediately be thrown away

A

valid

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How can we identify outliers?

A

Look at how for a point is from the mea, often how many standard deviations away from the mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What’s the first thing to do when trying to find outliers?

A

Plot the data and look for problems by eye

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are 4 things we can do with outliers?

A

Leave it
Remove it
Remove the value and treat is as a missing value
Impute it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is imputation?

A

Replacing values with a calculated plausible value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are some drawbacks of imputation?

A

Can introduce bias
Introduce fake trends
Wash out real trends

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are common ways to impute?

A

Average nearby values
Average of known data
Linear regression of nearby values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is entity resolution or record linkage?

A

The process of finding multiple values that actually refer to the same entity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly