Six Sigma Correlation, Regression, and Hypothesis Testing Flashcards
Summary of correlation
Investigate relationship with x factor, inputs, and y (outputs)
Does a relationship exist?
What is it?
What factor has the biggest impact?
Famous correlation maxim
Correlation does not equal causation
When to use?
- Relating x input to y output
- Look at their relationship over time
- Identify key x inputs
- Measure outputs
Design of Experiments
Identifies rigorous methodology to identify x factors, y factors, and outputs
Examples of correlation
- Hrs of experience of the work correlated to incorrectly installed modules
- Or visual acuity test to output
- Age to blood pressure
- Sales success to level of education/years of experience
Scatter plots reintroduction
- Plot
- Is there a relationship?
- What type of relationship? (Positive/Negative)
- How strong?
Example of nonlinear correlation
Nonlinear relationships are much more complex
EXAMPLE: Oil changes on engine life
- Manufacturer recommends every 4k miles
- We know changing every 20k miles has a negative effect
- But what happens if we change every 1k, 2.5k miles?
Correlation coefficient
AKA Pearson correlation coefficient
An expression of the linear relationship in our data
Values fall between -1 and +1
Helps us understand weakness (help distinguishing between factors)
Interpreting correlation coefficient
1 = perfect positive line (fit of piston in an engine) .82 = closely related, but not super tight/close 0 = no correlation -.82 = closely related, but not as tight -1 = perfect negative line (noise in environment's effect on concentration)
Tips on the correlation coefficient
- Only works for linear relationships
- Highly sensitive to outliers
Calculating the correlation coefficient
r = !!!
Difficult
GENERAL SUMMARY
The covariance of the two variables divided by the product of their standard deviations. We need:
- Xi- individual values of first variable
- Yi - individual values of the second variable
- n - the number of pairs of data in the dataset
There is a Pearson coefficient lookup table
Causation
The act of causing/agency that produces the effect.
Understanding/determining which x-variables result in which variable y outputs of our processes.
Key components between causation and the x/y factors
- Asymmetrical (correlation is symmetrical but DOES NOT indicate causation. Causation is asymmetrical, or one directioned)
- Causation is NOT reversible. (Hurricane causes the phone lines to go down, but not vice versa)
- Can be difficult to determine causation. (Is there a third, unknown variable?)
- Correlation CAN help POINT to causation. We rule out data that is unrelated.
Common mistakes when looking for causation
- Genuine causation - clear, uncomplicated data to support proposal of causation
- Common response - the common response to the unknown variable occurs when both x and y react the same way to an unseen variable
- Confounding - the effect of one variable, x, on y is mixed up with the effects of other explanatory values on the y output that we’re looking for in the process.
The statistical significance of correlation
First Ask: Are we focusing in on the right variables?
Then: Which of our correlation coefficients are subject to chance?
Next: What’s the significance of the correlations?
P-Value
Used to allow us to determine and measure the significance (not necessarily the importance) of two different relationships.
It does provide statistical evidence of the relationship
Looking for p value of less than 0.05. True when alpha-factor is that. Because we’re shooting for 95% confidence
What effect illustrates the importance of asking ‘Is the correlation by chance?’?
Known as the Hawthorne Effect - paying attention to something will often increase performance.
Hawthorne Electric: Early 20s, had a hypothesis that increasing lighting increases productivity
- Got a baseline on productivity
- Then upped lighting by 10%
- Kept doing until they couldn’t go any higher
- Then asked, what happens if we turn the lights right back where we started?
- When they changed it back, productivity increased AGAIN
- This blew up the lighting = productivity hypothesis
The correlation is that by paying attention to people/productivity, they become more productive.
What other question is important to ask regarding correlation?
What are the chances of finding a correlation value OTHER than what we estimated in our example.
EX: Someone’s height vs self-esteem
Regression analysis
Forecast the change in the dependent variable in our process.
Describe the relationship between predictor variables ( x ) and output y (response variable).
Simple linear regression
Gets us a best fit line (red line going through center of the plot). Only one y per x.
Vs. Multiple : where many y’s per x
EXAMPLE
If we’re only comparing height and weight, that’s simple linear.
If we want to do height, age and gender against weight, that’s three different factors, three different multiple linear regressions.
Simple linear regression formula
y factor = B0 + B1 x+ e
Beta factor - the effect of the process. We run through the formula for various lines, looking for best fit.
Testing for the best fit, determined by the lowest sum of the squared residuals.
Considerations for simple linear least-squares
- Nonlinear relationship between x factors and y outputs
- Importance of outlier data
- Consider the inconsistency of the variance in the residuals
How does the simple linear least-squares regression help?
It’s too cumbersome to use the simple linear regression formula for every possible line. There’s a simple way to find it:
Simple linear least-squares regression
- Where beta-zero and beta-one are present and estimate the true value of themselves
- As opposed to the beta-zero being the value of the y intercept, or beta-one being the value of the slope.
Predicting Outcomes with Regression Analysis/Models
Regression calculation that allows us to isolate sources of variation
Ex: Sales Forecasting
- Identifying controllable factors and their effects on sales is a valid exercise
- What factors come into play for sales success?