Business Analysis Flashcards
What is the problem solving framework?
- Business issues understanding
- Data understanding
- Data preparation
- Analysis/Modeling
- Validation
- Presentation/visualisation
Pretty Faces Cosmetics wants to market to their top 3 customer segments. They need to create the customer segments, calculate how many customers fall into each one, and then determine which segment has the most customers. What kind of data analysis is needed?
This is a Segmentation and Aggregation analysis problem. First, the company wants to create groups of their customers which is a segmentation problem. Next, they want to aggregate data by each group and select the top three for marketing purposes.
When we have data poor problem, what kind of testing do we use?
We use A/B testing if we have a data poor problem
When we have a data rich problem, what kind of testing do we use?
We use numeric or classification if we have a data rich problem.
To determine the above, ask if the outcome is a numeric outcome, or non-numeric outcome (the latter being for classification). Classifications could be categories - eg will a customer pay on time, or default? - eg will an electronic device fail before 1000 hours of use?
Numeric = regression models Classification = classification models
Name the three types of numeric variables
The three types of numeric variables are Continuous, Time Based, and Count.
Continuous variables = number with decimal points like height (1.84141 m) - ie we don’t grow in even inch intervals. Uses Continuous analysis for research
Time Based variables = predict what will happen over time (eg forecasting). Uses Time Series analysis for research.
Count variables = discrete positive integers. Ie number of people - you can’t have 1.78 people.
For categorical outcomes (non-numeric), what are outcomes with two possible categories called?
For categorical analysis
Binary outcomes - eg Yes, no.
For categorical outcomes (non-numeric), what are outcomes with more than two possible categories called?
For categorical analysis
Non-binary - eg small, medium, large
Pay on time, pay late, not pay at all
Using historical production data, a manufacturer wants to know how many tricycles they will need to produce per month over the next six months.
Should a numeric or classification model be used to solve the problem?
If numeric, should a continuous or time-based model be used?
If classification, should a binary or non-binary model be used?
Time based numeric
Step 1: The target outcome is the number of tricycles. Therefore, we should use a numeric model for this problem.
Step 2: Since we are looking to forecast the number of tricycles over time, we should use a time-based model.
Using historical production data, a manufacturer wants to know how many tricycles they will need to produce per month over the next six months.
Should a numeric or classification model be used to solve the problem?
If numeric, should a continuous or time-based model be used?
If classification, should a binary or non-binary model be used?
Continuous numeric
Step 1: The target outcome is the number of pizzas. Therefore, we should use a numeric model for this problem.
Step 2: Since the number of pizzas is a continuous variable, and not related to time, we should use a continuous model.
A bank wants to predict whether a new customer will default on a loan, always pay on time, or sometimes pay using historical data of their clients.
Should a numeric or classification model be used to solve the problem?
If numeric, should a continuous or time-based model be used?
If classification, should a binary or non-binary model be used?
Non-binary classification
Step 1: There are three possible target outcomes - pay on time, sometimes pay, or default. Therefore, this is a categorical outcome and we should use a Classification Model.
Step 2: Since there are three possible categorical outcomes, we should use a Non-Binary model to predict the outcome.
A marketing organization wants to predict whether someone is likely to redeem a coupon as they would like to minimize costs and only send coupons to people who are likely to use them.
Should a numeric or classification model be used to solve the problem?
If numeric, should a continuous or time-based model be used?
If classification, should a binary or non-binary model be used?
Binary classification
Step 1: There are two possible target outcomes - the person will redeem the coupon, or the person will not redeem the coupon. Therefore, this is a categorical outcome and we should use a Classification Model.
Step 2: Since there are only two possible categorical outcomes, we should use a Binary model to predict the outcome.
What is the slope-intercept form?
The slope-intercept form is y = mx + b
m = slope and b = y intercept
- Where m is the slope and b is the y-intercept
Identify the slope, m. This can be done by calculating the slope between two known points of the line using the slope formula.
Slop formula m = y2 - y1 / x2 - x1
- Find the y-intercept. This can be done by substituting the slope and the coordinates of a point (x, y) on the line in the slope-intercept formula and then solve for b.
y = mx + b
X = Predictor Variable (aka independent variable - eg number of employees) Y = Target Variable (aka dependent variable - eg ticket number) m = Slope of the line b = Y-intercept
eg
x = Number of employees (eg 51)
y = Average number of tickets submitted per week (eg 5)
using excel slope formula ‘=slope(all of y column, all of x column)’
you then get slope = 0.1833
This means that for every extra employee the ticket average for that company goes up by 0.1833. Note the size of the company allow us to predict the ticket numbers that may be submitted each week. The bigger the company, the greater number of tickets there will be.
Explain correlation
Correlation is between -2 and 1. If close to -1 or 1 this means there is a high correlation (-1 neg, and 1 pos).
Explain the coefficient of determination (r2)
This tells us how well the data fits our line (the line in the linear model). R2 is the percent of variance in observations that are explained by the model. An r2 closed to 1 means that nearly all variance in the target variable (ie y axis, or the number of tickets) is explained by the model. An r2 value greater than 0.7 is considered to be a strong model. An r2 value of 0.5 or greater is usually pretty good. Less than 0.3 is not useful. ie the number of employees is a good predictor for determining the number of tickets.
What is Multiple Linear Regression?
When you have more than one predictor variable (aka independent variable) - ie the x value in simple linear regression