Collect, Cleanse, and Optimize Your Data Flashcards
Considerations for Preparing Data
- Optimize data for analysis (accuracy, completeness, quantity, variety, relevance)
- Correct issues at the source
- Consolidate multiple sources
- Ensure observations are independent
- Calculate durations for date values
- Maximize interpretability for insights
Address Common Data Issues
- Extreme values and outliers
- Missing values
- Incorrect values
- Standardize categorical values
- Skewed data
- High-cardinality fields
- Binary outcomes and boolean fields
- Ordinal variables
- Duplicate, redundant, or highly correlated variables
Extreme values and outliers
Confirm whether they are relevant and real
Missing values
Impute a likely value using a mean or distribution
Mean could reduce your standard deviation
Distribution is more reliable
Remove records with missing values only if they don’t impact the analysis
Incorrect values
Predictive algorithms assume the input data is correct
Remove incorrect values or replace them with more correct or average values
Standardize categorical values
Ensure consistent category names
Remove spelling variations and fix typos
Use labels that are meaningful, recognizable, and easy to interpret
Skewed data
Continuous Variables:
- review the distributions, central tendency, and spread - confirm they are normally distributed
Categorical Variables:
- use a frequency table and bar chart to understand distributions of each category - use the Box-Cox transformation to fix skewed values
Continuous variables
Have an infinite number of values between any two values
Can be meaningfully divided into smaller increments, including fractions and decimals
Often measured on a scale
Can be numeric or date/time
Ex: height, weight, temperature
Categorical variables
Contain a finite number of categories or distinct groups
Might not have a logical order
Ex: gender, material type, payment method
Box-Cox Transformation
Transforms data so it closely resembles a normal distribution
Normality is an important assumption for many statistical techniques; if your data isn’t normal, applying a Box-Cox means you are able to run a broader number of tests
High-cardinality fields
Categorical attributes that contain many distinct values such as names, zip codes, or account numbers
Rarely used in predictive modeling
Including them vastly increases the dimensionality of the dataset and makes it difficult for most algorithms to build accurate prediction models
Binary outcomes and boolean values
If the binary values are numeric, convert them to text values
This will make it easier to interpret charts and explanations in the resulting insights
Ordinal variables
Numerical scores on an arbitrary scale to show ranking in a set of data points (such as low, medium, high)
These are problematic for predictive models
Predictive algorithms assume that the variable is an interval or ratio and can be misled or confused by the scale
Transform them into continuous or categorical variables
Duplicate, redundant, or highly correlated variables
Minimize variables that carry the same information
Collinearity occurs when two or more predictor variables are highly correlated
Exclude variables that are highly correlated or from the same reporting hierarchy
- People who live in the city of Tampa also live in the state of Florida