Predictive Data Analytics Flashcards
CRISP-DM
- Business Issue Understanding
a. What decision needs to be made? (What question needs to be answered)
b. What information is needed to inform that decision?
c. What type of analysis will provide the information to inform that decision? (if the information is unavailable then it will be a predictive analysis) - Data Understanding
a. What data is needed?
b. What data is available? - Data Preparation
a. Gather
b. Cleanse
c. Format
d. Blend
e. Sample - Analysis/ Modeling
a. Choose the appropriate method - Validation
a. Observe the key results on model
b. Ensure the results make sense within the context of the business problem
c. Determine whether to proceed to the next step or return to a previous phase
d. Repeat as many times as necessary - Presentation/Visualization
a. Determine the best method of presenting insights based on the analysis
b. Determine the best method of presenting insights based on the audience
c. Make sure the amount of information shared is not overwhelming
d. Use the results to tell a story to the audience
e. For more complex analyses, you may want to walk the audience through the analytical problem solving process
f. Always reference the data sources used
g. Make sure your analysis supports the decisions that need to be made
Types of Non-predictive Data Analysis
- Geospatial
- Segmentation
- Aggregation
- Descriptive
Common methods of Descriptive Statistics
Mean Median Mode Standard Deviation Interquartile Range
Which method should we use, when we are trying to solve a predictive problem, but don’t have any data? (data poor)
An experiment (A/B testing)
Numeric vs Non-numeric Predictive Analysis
Numeric: An outcome is a number (Regression Models)
Non-numeric: Determining the category something falls into (Classification models)
Target Variable
Target variables represent the outcome we are trying to predict. In order to select the right predictive model, we first determine whether the target variable is numeric or non-numeric. The type of numeric or non-numeric target variables will then help us select which model is appropriate
3 Types of Numeric variables
Continuous
A continuous variable is one that can take on all values in a range. For instance your height can be measured down to many decimal places. We do not grow in even inch intervals.
Time-Based
A time-based numeric variable is one where you are trying to predict what will happen over time. This is often related to forecasting.
Count
Count variables are numbers that are discrete, positive integers. They’re called count numbers because they’re used to analyze variables that you can count. As modeling these type of variables is not common in business, we won’t be covering this topic in this course.
Which method can be employed when the Target variable is Continuous
Continuous models
Which method can be employed when the Target variable is Time Based
Time series analysis
Non-Numeric Variable
A non-numeric variable is often called categorical, because the values of the variable take on a discrete number of possible values or categories. Examples include whether an electronic device will fail before 1000 hours or not; whether a customer will pay on-time, pay late, or default on a payment, or whether a store is classified as large, medium or small.
Classification Models
Classification Models: Binary and Non-Binary
When modeling categorical variables, the number of possible outcomes is an important factor. If there are only two possible categorical outcomes, such as Yes or No, or True or False, then the variable can be described as Binary.
If there are more than two possible categorical outcomes, such as small, medium, or large, or pay on-time, pay late, or default on a payment, then the variable can be described as non-binary. The important take-away from this lesson is the ability to determine if you should use a classification model, and whether it should be a binary model or a non-binary model. Ben Burkholder will lead a course focused on classification models and will go into detail about these types of models.
Linear Regression
y = mx + b
Y = Target Variable X = Predictor Variable m = Slope of the line b = Y-intercept
Target Variable
The target variable is the variable we are trying to understand and predict. It is also referred to as the dependent variable. In our example, we are trying to predict Y, or the average number of tickets.
Predictor Variable
Predictor variables are used to try to predict the target variable and are also known as independent variables. In the example there is just one predictor variable, X, or the number of employees. It is used to predict the number of tickets based.