Collect, Cleanse, and Optimize Your Data Flashcards

1
Q

Considerations for Preparing Data

A
  • Optimize data for analysis (accuracy, completeness, quantity, variety, relevance)
  • Correct issues at the source
  • Consolidate multiple sources
  • Ensure observations are independent
  • Calculate durations for date values
  • Maximize interpretability for insights
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Address Common Data Issues

A
  • Extreme values and outliers
  • Missing values
  • Incorrect values
  • Standardize categorical values
  • Skewed data
  • High-cardinality fields
  • Binary outcomes and boolean fields
  • Ordinal variables
  • Duplicate, redundant, or highly correlated variables
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Extreme values and outliers

A

Confirm whether they are relevant and real

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Missing values

A

Impute a likely value using a mean or distribution

Mean could reduce your standard deviation

Distribution is more reliable

Remove records with missing values only if they don’t impact the analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Incorrect values

A

Predictive algorithms assume the input data is correct

Remove incorrect values or replace them with more correct or average values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Standardize categorical values

A

Ensure consistent category names

Remove spelling variations and fix typos

Use labels that are meaningful, recognizable, and easy to interpret

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Skewed data

A

Continuous Variables:

 - review the distributions, central tendency, and spread
 - confirm they are normally distributed

Categorical Variables:

 - use a frequency table and bar chart to understand distributions of each category
 - use the Box-Cox transformation to fix skewed values
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Continuous variables

A

Have an infinite number of values between any two values

Can be meaningfully divided into smaller increments, including fractions and decimals

Often measured on a scale

Can be numeric or date/time

Ex: height, weight, temperature

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Categorical variables

A

Contain a finite number of categories or distinct groups

Might not have a logical order

Ex: gender, material type, payment method

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Box-Cox Transformation

A

Transforms data so it closely resembles a normal distribution

Normality is an important assumption for many statistical techniques; if your data isn’t normal, applying a Box-Cox means you are able to run a broader number of tests

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

High-cardinality fields

A

Categorical attributes that contain many distinct values such as names, zip codes, or account numbers

Rarely used in predictive modeling

Including them vastly increases the dimensionality of the dataset and makes it difficult for most algorithms to build accurate prediction models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Binary outcomes and boolean values

A

If the binary values are numeric, convert them to text values

This will make it easier to interpret charts and explanations in the resulting insights

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Ordinal variables

A

Numerical scores on an arbitrary scale to show ranking in a set of data points (such as low, medium, high)

These are problematic for predictive models

Predictive algorithms assume that the variable is an interval or ratio and can be misled or confused by the scale

Transform them into continuous or categorical variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Duplicate, redundant, or highly correlated variables

A

Minimize variables that carry the same information

Collinearity occurs when two or more predictor variables are highly correlated

Exclude variables that are highly correlated or from the same reporting hierarchy
- People who live in the city of Tampa also live in the state of Florida

How well did you know this?
1
Not at all
2
3
4
5
Perfectly