Collect, Cleanse, and Optimize Your Data Flashcards

Question 1

Q

Considerations for Preparing Data

Answer

A

Optimize data for analysis (accuracy, completeness, quantity, variety, relevance)
Correct issues at the source
Consolidate multiple sources
Ensure observations are independent
Calculate durations for date values
Maximize interpretability for insights

Question 2

Q

Address Common Data Issues

Answer

A

Extreme values and outliers
Missing values
Incorrect values
Standardize categorical values
Skewed data
High-cardinality fields
Binary outcomes and boolean fields
Ordinal variables
Duplicate, redundant, or highly correlated variables

Question 3

Q

Extreme values and outliers

Answer

A

Confirm whether they are relevant and real

Question 4

Q

Missing values

Answer

A

Impute a likely value using a mean or distribution

Mean could reduce your standard deviation

Distribution is more reliable

Remove records with missing values only if they don’t impact the analysis

Question 5

Q

Incorrect values

Answer

A

Predictive algorithms assume the input data is correct

Remove incorrect values or replace them with more correct or average values

Question 6

Q

Standardize categorical values

Answer

A

Ensure consistent category names

Remove spelling variations and fix typos

Use labels that are meaningful, recognizable, and easy to interpret

Question 7

Q

Skewed data

Answer

A

Continuous Variables:

 - review the distributions, central tendency, and spread
 - confirm they are normally distributed

Categorical Variables:

 - use a frequency table and bar chart to understand distributions of each category
 - use the Box-Cox transformation to fix skewed values

Question 8

Q

Continuous variables

Answer

A

Have an infinite number of values between any two values

Can be meaningfully divided into smaller increments, including fractions and decimals

Often measured on a scale

Can be numeric or date/time

Ex: height, weight, temperature

Question 9

Q

Categorical variables

Answer

A

Contain a finite number of categories or distinct groups

Might not have a logical order

Ex: gender, material type, payment method

Question 10

Q

Box-Cox Transformation

Answer

A

Transforms data so it closely resembles a normal distribution

Normality is an important assumption for many statistical techniques; if your data isn’t normal, applying a Box-Cox means you are able to run a broader number of tests

Question 11

Q

High-cardinality fields

Answer

A

Categorical attributes that contain many distinct values such as names, zip codes, or account numbers

Rarely used in predictive modeling

Including them vastly increases the dimensionality of the dataset and makes it difficult for most algorithms to build accurate prediction models

Question 12

Q

Binary outcomes and boolean values

Answer

A

If the binary values are numeric, convert them to text values

This will make it easier to interpret charts and explanations in the resulting insights

Question 13

Q

Ordinal variables

Answer

A

Numerical scores on an arbitrary scale to show ranking in a set of data points (such as low, medium, high)

These are problematic for predictive models

Predictive algorithms assume that the variable is an interval or ratio and can be misled or confused by the scale

Transform them into continuous or categorical variables

Question 14

Q

Duplicate, redundant, or highly correlated variables

Answer

A

Minimize variables that carry the same information

Collinearity occurs when two or more predictor variables are highly correlated

Exclude variables that are highly correlated or from the same reporting hierarchy
- People who live in the city of Tampa also live in the state of Florida

Collect, Cleanse, and Optimize Your Data Flashcards

(14 cards)