Business Analysis Flashcards

1
Q

What is the problem solving framework?

A
  1. Business issues understanding
  2. Data understanding
  3. Data preparation
  4. Analysis/Modeling
  5. Validation
  6. Presentation/visualisation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Pretty Faces Cosmetics wants to market to their top 3 customer segments. They need to create the customer segments, calculate how many customers fall into each one, and then determine which segment has the most customers. What kind of data analysis is needed?

A

This is a Segmentation and Aggregation analysis problem. First, the company wants to create groups of their customers which is a segmentation problem. Next, they want to aggregate data by each group and select the top three for marketing purposes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

When we have data poor problem, what kind of testing do we use?

A

We use A/B testing if we have a data poor problem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

When we have a data rich problem, what kind of testing do we use?

A

We use numeric or classification if we have a data rich problem.

To determine the above, ask if the outcome is a numeric outcome, or non-numeric outcome (the latter being for classification). Classifications could be categories - eg will a customer pay on time, or default? - eg will an electronic device fail before 1000 hours of use?

Numeric = regression models
Classification = classification models
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Name the three types of numeric variables

A

The three types of numeric variables are Continuous, Time Based, and Count.

Continuous variables = number with decimal points like height (1.84141 m) - ie we don’t grow in even inch intervals. Uses Continuous analysis for research

Time Based variables = predict what will happen over time (eg forecasting). Uses Time Series analysis for research.

Count variables = discrete positive integers. Ie number of people - you can’t have 1.78 people.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

For categorical outcomes (non-numeric), what are outcomes with two possible categories called?

A

For categorical analysis

Binary outcomes - eg Yes, no.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

For categorical outcomes (non-numeric), what are outcomes with more than two possible categories called?

A

For categorical analysis

Non-binary - eg small, medium, large

Pay on time, pay late, not pay at all

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Using historical production data, a manufacturer wants to know how many tricycles they will need to produce per month over the next six months.

Should a numeric or classification model be used to solve the problem?
If numeric, should a continuous or time-based model be used?
If classification, should a binary or non-binary model be used?

A

Time based numeric

Step 1: The target outcome is the number of tricycles. Therefore, we should use a numeric model for this problem.

Step 2: Since we are looking to forecast the number of tricycles over time, we should use a time-based model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Using historical production data, a manufacturer wants to know how many tricycles they will need to produce per month over the next six months.

Should a numeric or classification model be used to solve the problem?
If numeric, should a continuous or time-based model be used?
If classification, should a binary or non-binary model be used?

A

Continuous numeric

Step 1: The target outcome is the number of pizzas. Therefore, we should use a numeric model for this problem.

Step 2: Since the number of pizzas is a continuous variable, and not related to time, we should use a continuous model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

A bank wants to predict whether a new customer will default on a loan, always pay on time, or sometimes pay using historical data of their clients.

Should a numeric or classification model be used to solve the problem?
If numeric, should a continuous or time-based model be used?
If classification, should a binary or non-binary model be used?

A

Non-binary classification

Step 1: There are three possible target outcomes - pay on time, sometimes pay, or default. Therefore, this is a categorical outcome and we should use a Classification Model.

Step 2: Since there are three possible categorical outcomes, we should use a Non-Binary model to predict the outcome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

A marketing organization wants to predict whether someone is likely to redeem a coupon as they would like to minimize costs and only send coupons to people who are likely to use them.

Should a numeric or classification model be used to solve the problem?
If numeric, should a continuous or time-based model be used?
If classification, should a binary or non-binary model be used?

A

Binary classification

Step 1: There are two possible target outcomes - the person will redeem the coupon, or the person will not redeem the coupon. Therefore, this is a categorical outcome and we should use a Classification Model.

Step 2: Since there are only two possible categorical outcomes, we should use a Binary model to predict the outcome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the slope-intercept form?

A

The slope-intercept form is y = mx + b

m = slope and b = y intercept

  1. Where m is the slope and b is the y-intercept
    Identify the slope, m. This can be done by calculating the slope between two known points of the line using the slope formula.

Slop formula m = y2 - y1 / x2 - x1

  1. Find the y-intercept. This can be done by substituting the slope and the coordinates of a point (x, y) on the line in the slope-intercept formula and then solve for b.

y = mx + b

X = Predictor Variable (aka independent variable - eg number of employees)
Y = Target Variable (aka dependent variable - eg ticket number)
m = Slope of the line
b = Y-intercept

eg
x = Number of employees (eg 51)
y = Average number of tickets submitted per week (eg 5)

using excel slope formula ‘=slope(all of y column, all of x column)’

you then get slope = 0.1833
This means that for every extra employee the ticket average for that company goes up by 0.1833. Note the size of the company allow us to predict the ticket numbers that may be submitted each week. The bigger the company, the greater number of tickets there will be.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Explain correlation

A

Correlation is between -2 and 1. If close to -1 or 1 this means there is a high correlation (-1 neg, and 1 pos).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Explain the coefficient of determination (r2)

A

This tells us how well the data fits our line (the line in the linear model). R2 is the percent of variance in observations that are explained by the model. An r2 closed to 1 means that nearly all variance in the target variable (ie y axis, or the number of tickets) is explained by the model. An r2 value greater than 0.7 is considered to be a strong model. An r2 value of 0.5 or greater is usually pretty good. Less than 0.3 is not useful. ie the number of employees is a good predictor for determining the number of tickets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Multiple Linear Regression?

A

When you have more than one predictor variable (aka independent variable) - ie the x value in simple linear regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the equation for Multiple Linear Regression?

A

The equation for multiple linear regression is

y = b0 + b1x1 + b2x2 + b3x3

y = target variable 
b0 = intercept (or the baseline value)

b1, b2, b3 = the coefficients of the variables x1, x2, and x3 (ie your different variables)

A Linear Regression will find values for b0, b1, b2, b3, and these values represent the relationship we observe between the predictor variables (ie x1, x2, and x3) and the target variable (ie y)

The model is referred to as linear as the expression is based on a linear combination. An expression constructed from a set of terms, by multiplying each term by a constant and then adding the results

17
Q

Why should we use the adjusted R-squared value for multiple linear regression

A

The adjusted r-squared value should be used with multiple linear regressions due to a phenomenon that occurs when adding additional variables to the model. In a nutshell, the more variables that are included, the higher the r-squared value will be - even if there is no relationship between the additional variables and the target variable. Therefore, we use the Adjusted R-squared value.

18
Q

What kind of variables do we use to represent categorical variables?

A

We use dummy variables to represent categorical variables

“A dummy variable can only take on two values, generally zero or one. You would add one dummy variable for one less than the number of unique values in the categorical variable. So if the variable is binary, you’d add one dummy. If there are four categories, you’d add three dummy variables”

ie - if there are 5 categorical variables, you add 4 dummy variables
4 categorical variables, you add 3 etc.

19
Q

What is the target variable?

A

The target variable is the the one we’re looking to predict

eg the number of IT tickets an IT services company deals with form different clients.

The predictor variables are the other variables that help you predict this eg no. of employees per client company, industry, value of contract etc.

20
Q

How would you use a weight variable for weighted least squares in Alteryx?

A

“Use a weight variable for weighted least squares” allows the user to set a weighting value to each row of data. An example of when you might want to use this would be if you wanted to weight clients who were well established more than other clients who were relatively new because you felt that the Average number of tickets would be more accurate for the established clients.
Then you could add a column of data, setting the number 2 for each established client and a 1 for each new client. This would weight the established clients twice as much in determining the equation for the linear regression.

21
Q

What do Coefficient Estimates represent?

A

Remember our regression equation? Y = B0+B1X1+B2X2…? These coefficients are the estimates of the B’s. They represent the magnitude of the relationship between each predictor variable and the target variable. For example, the coefficient on the number.of.employees means that each additional employee will lead to roughly an additional 0.1 tickets, holding all other variables constant. A simpler way to think about this is that we can expect about 1 ticket for every 10 employees.

22
Q

What is the P value?

A

The p value is the probability that observed results (the coefficient estimate) occurred by chance, and that there is no actual relationship between the predictor and target variable. In other words, the p-value is the probability that the coefficient is zero. The lower the p-value the higher the probability that a relationship exists between the predictor and target variable. If the p-value is high, we should not rely on the coefficient estimate. When a predictor variable has a p-value below 0.05, the relationship between it and the target variable is considered to be statistically significant.

p value > 0.05

23
Q

What is R Squared?

A

R-squared ranges from 0 to 1 and represents the amount of variation in the target variable (ie # number of tickets) explained by the variation in the predictor variables (ie # of employees, value of contract). The higher the r-squared, the higher the explanatory power of the model.

In our example, the R-squared value is 0.9651, and the adjusted R-squared value is 0.9558. Therefore, we’ve been able to improve the model with the addition of the category. In a real life problem, we might run the model with different predictor variables, or see if we had additional information to add to the model.

Now that we have a strong model, we can perform our analysis.

24
Q

What are the five most common data types?

A

Strings, numeric, date/time, Boolean/Logical, Special Objects

Special Objects can be images, maps, reports, sound files

25
Q

What are some data type examples in Alteryx

A

Create new variable: Use formula tool to create a field based on one of the existing field.

Change data type : Use select tool to change the data type of one or more variables.

Automate data type selection: Use Autofield tool to have the data type of each field be set automatically to best fit the data.

26
Q

What are histograms good for?

A

For continuous variables, you can see how they are distributed with a histogram. A histogram is similar to a bar plot but the a variable is binned into ranges, then counted up in each bin. Histograms’ connected bars imply a continuous progression in values. They are simple to create with most data visualization software.

Histograms are great at showing outliers and displaying how the data is distributed. Also, not all variables you collect will be normally distributed! You can make incorrect conclusions by assuming normality.

27
Q

What are scatter plots for?

A

You can look at relationships between variables with scatter plots. This will help you identify variables that are correlated or have other interesting relationships.

This scatter plot is showing the relationship between height and weight for a group of men. It’s pretty clear that weight increases as height increases, in general. This is what we should expect of course, taller people are typically heavier too!