Six Sigma Correlation, Regression, and Hypothesis Testing Flashcards
Summary of correlation
Investigate relationship with x factor, inputs, and y (outputs)
Does a relationship exist?
What is it?
What factor has the biggest impact?
Famous correlation maxim
Correlation does not equal causation
When to use?
- Relating x input to y output
- Look at their relationship over time
- Identify key x inputs
- Measure outputs
Design of Experiments
Identifies rigorous methodology to identify x factors, y factors, and outputs
Examples of correlation
- Hrs of experience of the work correlated to incorrectly installed modules
- Or visual acuity test to output
- Age to blood pressure
- Sales success to level of education/years of experience
Scatter plots reintroduction
- Plot
- Is there a relationship?
- What type of relationship? (Positive/Negative)
- How strong?
Example of nonlinear correlation
Nonlinear relationships are much more complex
EXAMPLE: Oil changes on engine life
- Manufacturer recommends every 4k miles
- We know changing every 20k miles has a negative effect
- But what happens if we change every 1k, 2.5k miles?
Correlation coefficient
AKA Pearson correlation coefficient
An expression of the linear relationship in our data
Values fall between -1 and +1
Helps us understand weakness (help distinguishing between factors)
Interpreting correlation coefficient
1 = perfect positive line (fit of piston in an engine) .82 = closely related, but not super tight/close 0 = no correlation -.82 = closely related, but not as tight -1 = perfect negative line (noise in environment's effect on concentration)
Tips on the correlation coefficient
- Only works for linear relationships
- Highly sensitive to outliers
Calculating the correlation coefficient
r = !!!
Difficult
GENERAL SUMMARY
The covariance of the two variables divided by the product of their standard deviations. We need:
- Xi- individual values of first variable
- Yi - individual values of the second variable
- n - the number of pairs of data in the dataset
There is a Pearson coefficient lookup table
Causation
The act of causing/agency that produces the effect.
Understanding/determining which x-variables result in which variable y outputs of our processes.
Key components between causation and the x/y factors
- Asymmetrical (correlation is symmetrical but DOES NOT indicate causation. Causation is asymmetrical, or one directioned)
- Causation is NOT reversible. (Hurricane causes the phone lines to go down, but not vice versa)
- Can be difficult to determine causation. (Is there a third, unknown variable?)
- Correlation CAN help POINT to causation. We rule out data that is unrelated.
Common mistakes when looking for causation
- Genuine causation - clear, uncomplicated data to support proposal of causation
- Common response - the common response to the unknown variable occurs when both x and y react the same way to an unseen variable
- Confounding - the effect of one variable, x, on y is mixed up with the effects of other explanatory values on the y output that we’re looking for in the process.
The statistical significance of correlation
First Ask: Are we focusing in on the right variables?
Then: Which of our correlation coefficients are subject to chance?
Next: What’s the significance of the correlations?
P-Value
Used to allow us to determine and measure the significance (not necessarily the importance) of two different relationships.
It does provide statistical evidence of the relationship
Looking for p value of less than 0.05. True when alpha-factor is that. Because we’re shooting for 95% confidence
What effect illustrates the importance of asking ‘Is the correlation by chance?’?
Known as the Hawthorne Effect - paying attention to something will often increase performance.
Hawthorne Electric: Early 20s, had a hypothesis that increasing lighting increases productivity
- Got a baseline on productivity
- Then upped lighting by 10%
- Kept doing until they couldn’t go any higher
- Then asked, what happens if we turn the lights right back where we started?
- When they changed it back, productivity increased AGAIN
- This blew up the lighting = productivity hypothesis
The correlation is that by paying attention to people/productivity, they become more productive.
What other question is important to ask regarding correlation?
What are the chances of finding a correlation value OTHER than what we estimated in our example.
EX: Someone’s height vs self-esteem
Regression analysis
Forecast the change in the dependent variable in our process.
Describe the relationship between predictor variables ( x ) and output y (response variable).
Simple linear regression
Gets us a best fit line (red line going through center of the plot). Only one y per x.
Vs. Multiple : where many y’s per x
EXAMPLE
If we’re only comparing height and weight, that’s simple linear.
If we want to do height, age and gender against weight, that’s three different factors, three different multiple linear regressions.
Simple linear regression formula
y factor = B0 + B1 x+ e
Beta factor - the effect of the process. We run through the formula for various lines, looking for best fit.
Testing for the best fit, determined by the lowest sum of the squared residuals.
Considerations for simple linear least-squares
- Nonlinear relationship between x factors and y outputs
- Importance of outlier data
- Consider the inconsistency of the variance in the residuals
How does the simple linear least-squares regression help?
It’s too cumbersome to use the simple linear regression formula for every possible line. There’s a simple way to find it:
Simple linear least-squares regression
- Where beta-zero and beta-one are present and estimate the true value of themselves
- As opposed to the beta-zero being the value of the y intercept, or beta-one being the value of the slope.
Predicting Outcomes with Regression Analysis/Models
Regression calculation that allows us to isolate sources of variation
Ex: Sales Forecasting
- Identifying controllable factors and their effects on sales is a valid exercise
- What factors come into play for sales success?
Key components
- Apply a linear equation to the data set (obtain a least-squares line)
- Helps us predict future values of x based on existing factors
Regression coefficient
- There are various ways of getting b (the regression coefficient)
- Understand the y intercept value (expressed as little a)
Before we use the model, we have to know a and b
How to plot and develop data for regression model
- Plot the scatter diagram (not the line, just the dots)
2. Get x-bar and y-bar (sum up totals, divide by the count)
Real life examples for regression models
- Reducing handle/hold
Times in contact center
x factor - time to get CSR computer turned on - Understanding how processing temperature in production of wall material
Cost of material into pipe is 50-60% of sales
Can we find an optimal process temperature that gives a better wall thickness?
Hypothesis Testing & Inferential Statistics revolve around what 4 things?
- Draw conclusions about population based on sample data
- Test a claim about population parameter
- Provide evidence to support opinion
- Check for statistical significance
What 6 Sigma phase does the hypothesis testing take place during?
DMAIC
Analyze
Example: Hypothesis: What would be the effect on customer satisfaction if we reduce the time to answer the phone, or how long it takes to provide a quality answer
Another hypothesis: If we’re able to tighten our control over temperature for pipe production, would we be able to minimize costs while maintaining customer satisfaction levels
Describe the descriptive vs relational categories of hypothesis testing
Descriptive
What we can physically measure about something (size, form, distribution).
- Our ability to manipulate this
Relational
- What’s the relationship between the variables?
- Positive or negative?
- Greater or lesser than a given value?
- Ex: reducing handle times in customer contact center’s effect on satisfaction
Types of hypothesis tests
- 1-sample hypothesis test for means
- 2-sample hypothesis tests for the means
- Paired t-test
- Test for proportions
- Test for variances
- ANOVA - Analysis of Variances
Paired T-test
We use two sample means to prove/disprove hypothesis about two different populations of the data.
- Do we see shifts based on the hypothesis
- Ex: is there a relationship on the handle time for a call vs CSRs experience
The 5 steps of hypothesis testing
- Establish our null and alternative hypotheses (Ho & Ha)
- Testing our considerations (what are the things we want to test for, and how will we manage the process)
- Calculate test statistics
- Whether to apply critical value/p-value method while comparing our desired confidence level to the test results
- Interpret results
Null hypothesis
- “What they say” and expresses the status quo
- Assumes any observed differences are due to chance or random variation
- Often expressed as = , >= or <=
Alternative hypothesis
- “What we want to test/prove”
- Assumes the observed differences are real and NOT due to chance/random variation
- often !=, > or
The null hypothesis
Null hypothesis - assuming population parameters of interest are equal and there is no change or difference
Ex: Humidity will not have en effect on the weight of the parts we measure
Ex: The country you live in would not have an effect on your level of life satisfaction
The alternative hypothesis
Represented by H with subscript “a”.
Wants to look at parameters of interest that are not equal, assuming the difference is real.
Ex: Assume greater level of CSR experience directly correlates to quality of work output
Goals in hypothesis testing
- Reject the null in favor of proving that there is Ha
- Need to prove it is statistically significant
- We’re expecting to find a no, but we could reject the null
- Fail to reject: we find insufficient evidence to claim that the null hypothesis was valid, or the alternative true
- Presenting the results - Even though it’s more natural sounding to state in view of the Ha, we actually express in terms of whether or not we’re rejecting the null hypothesis
- “We reject the null hypothesis.” OR
- “We fail to reject the null hypothesis.”
The types of error
Type I error (alpha risk) - constant risk
Type II error (beta risk) - the effect we’re looking for
Type I Error (alpha/constant risk)
The risk we’re willing to take in rejecting the null hypothesis when it’s actually true (producer’s risk)
- False alarm
- False negative
- Error with the alpha factor
Common alpha factor = 0.05
- Testing: what’s the possibility of making a type 1 error at that confidence level
Alpha significance level
Signifies the degree/risk of failure that’s acceptable to us in the study at hand.
Helps decide if null can be rejected
1-alpha Confidence level acceptance region
Signifies the level of assurance we expect with the results of the data being studies
- Describes the uncertainty of the sample method you’re using
Type B, beta risk (II)
Most common beta risk value is 0.10
- Similar to failing to find the defective piece when producing a product
- AKA Consumer risk
- False positive
Are alpha and beta inversely proportional?
Yes
How do test tails work
If Ha mu > mu-o (hypothesis mean) > one-tailed test to the right
If Ha mu < mu-o > one-tailed test to the left
If Ha mu != mu-o, two-tailed test (we’ll find defects on both sides of the data curve)
How do we use the concept of critical value?
Used to compute the margin of error
Derived by critical value x standard deviation/standard error of statistic
How is the critical value test statistic derived?
Z = (x-bar - mu)/(sigma/sq root of n)
What is the acceptance region?
The confidence level of a test of 1-alpha
If alpha factor is 5%, resulting confidence factor would be 95%
Two-tailed test
If Ha mu != mu-o
Testing on both sides of the mean
The power of a test
The ability to make the correct decision of a test.
Power/sensitivity of a statistical test of rejecting a null hypothesis when it’s actually false.
Power of a test helps increase the likelihood of rejecting a null hypothesis correctly.
Four factors on the power of a test
- Sample size
- Population differences
- Variability
- Alpha level
Sample size
Most important part of power of a test.
Population differences
Particularly important when planning the study.
Sample must be large enough to avoid type II errors.
Variability
Less variability = more power
Ex. If looking at equivalency exams in schools, we know that sample size and variance matter
Alpha level
Most common: 0.05
Used to determine critical value
Plenty of instances where alpha factor of 0.05 results in rejecting the null hypothesis, but an alpha factor of 0.01 would not.
P-value, what is it, what do we use it for
Used to determine statistical significance.
Use it to evaluate how well the data supports the null hypothesis.
Key things in p-value
Effect size
Sample size
Variability of data
What does a low p-value mean?
Indicates that the sample data contains enough evidence to reject the null for the population.
Rhyming maxim for p-value interpretation
“If the p is high, null will fly.”
“If p is low, null will go.”
Examples of p value
If p value is less than the alpha factor (0.05 in this case), then we reject.
If p value is greater than alpha factor, then we do not reject.