Six Sigma Statistics and Graphical Presentation Flashcards
By Ron Crabtree
Sample
Subset of the overall population.
Make sure they are representative samples
Three most standard descriptive/characteristic statistics
Mean (arithmetic average)
Standard deviation
Variance
What are the symbols for the three most common data characteristics for both population parameters and sample statistics?
MEAN
Pop Par: mu
Sample stat: x-bar
STANDARD DEVIATION
PP: sigma
SS: s
VARIANCE
PP: sigma squared
SS: s squared
Descriptive statistics
Used to describe the process itself
One of most common tools: histogram. Variation, centering
Inferential statistics
Making inferences about the population from your sample
It’s possible to learn meaningful information with as little as 30 measurements.
Compare descriptive vs. inferential statistics
DESCRIPTIVE
Approach: More inductive (induce information)
Goal: Summarize the data to make decisions
Tools/Techniques: Histograms, interrelationship diagrams, process maps, fish bone diagrams
Interpretations: Fairly straightforward, Not as difficult to create
INFERENTIAL
Approach: Deductive (deduce information)
Goal: Infer population characteristics to predict future outcomes
Tools/Techniques: More advanced/complex, Chi squared, binomial, poisson distributions, hypothesis testing, confidence intervals, correlation, regression analysis.
Interpretations: Complex
Normal distribution
Most of the values in the data set are close to the average for the data. Standard deviation is small. Also allows for easy inference.
AKA The bell-shaped curve.
The 69-95-99 Percent Rule \+- One St Dev: 68.26 % Rule \+- Two St Dev: 95.44% Rule \+- Three St Dev: 99.74%
What are the basic tenants of the central limit theorem?
Basic tenants:
The sampling distribution of the mean approaches a normal distribution as the sample size increases.
n = 100, get a curve n = 500, a peak appears n = 1000, a normal distribution appears
As you increase samples, you get closer to a perfect bell curve
Something something central limit theorem
n = sample size for the sample mean
n = 4, get a near normal sampling dist
n = 30 will make the distribution normal
Basic tenants of confidence interval
Used to state some level of confidence that the mean of your population falls within a certain range
- Collect data for sample
- Calculate mean and standard deviation of sample
- Then make inference
Hypothesis testing
Test a null hypothesis, or a state of nature of which you do not know the true outcome.
H-naught (H0) typically set to test of two values are equal, or if greater/lesser than or equal to
H-sub-a: alternative of the null hypothesis.
Use data to infer the true state of the population
Control chart
Typically plotting data pulling at a consistent rate. Pooling samples
X axis is values (ex. 20 values), i.e. pulling 5 parts every hour and giving mean of those
Center line (mean of process)
UCL
LCL
Infer data about the entire population therein
Measures of central tendency
Whether or not the center of your process falls close to your target. Looking at the centering of your process
Measures of dispersion
How much variation within your process
What performance does six sigma aim for?
On target performance
As little variation as possible
Do you need to look at both central tendency & dispersion?
Yes. Need to look at BOTH measures to understand your data
Graph:
The three measures of central tendency
Mean - arithmetic average
x-bar = (sum of values/samples)/n
Median - middle data-point based on ordering
Data point to use = (n+1)/2
Ex. n = 7, use 4th data point
Mode - most frequently occurring value
You can have more than one mode, a bimodal distribution.
The three measures of dispersion
RANGE
= Maximum value - minimum value
Lets us ask: Which process is more tightly grouped around the mean?
STANDARD DEVIATION Gives a little more information about how much each data point varies from the mean AKA Sigma values Calculation: s = square root of (Sum of all values (Xi - X-bar)^2) / n-1 Xi = a score in the distribution The smaller the number, the less variation.
VARIANCE
The average of the squared differences from the mean.
Central to projects - goal is to reduce variation around the mean.
Not taking the square root, so not expressed with units of data.
Calculation: same as standard deviation but WITHOUT square root
(Sum of all values (Xi - X-bar)^2) / n-1
Xi = a score in the distribution
The smaller the number, the less variation.
Frequency distribution table
TWO COLUMNS
1st - Classes
2nd - Frequency of classes in data
(Optional 3rd) - Frequency as percentage
Usually collected with a check sheet
ex Determining how often a park is visited over time by certain classes of visitor.
Histogram are most frequent illustrations of frequency distributions. Or a pie chart.
How to make a frequency distribution table
- Organize the data into class intervals (ex 0-9, 10-19, etc)
- Remember, intervals must be mutually exclusive (it’s impossible to fall into 2 categories) - Record the data in the tally column
- Calculate frequency (percentage)
Tips for class intervals
Class intervals should be based on the number of data points.
- < 100 data points - 5-10 classes
- > 100 - 10-25 classes
- OR classes = square root of number of data points
To determine class interval
- Range/number of classes (or class interval = (maximum value-minimum value)/number of classes
- Make sure classes are mutually exclusive
- Include all data points
Cumulative frequency
Builds off of the frequency distribution table, but it provides information on the cumulative data.
Used to determine the number of observations above or below a particular value of the data set.
Helpful in understanding the behavior of the data.
Ends in additional columns for
- Cum frequency
- Cum percentage
Scatter diagrams
A way to graphically understand if there is a relationship between two variables.
X - Independent variable (Causal variable)
Y - Dependent variable
(Result)
Straight line through the dots
- Calculated by determining the “best fit” line
- Based on the slope, you can say whether they’re correlated
What questions do scatter plots answer?
- Is there a relationship?
- Is there a common pattern?
What are the types of correlation?
High positive: closely grouped points
- Ascending from left to right
- Means when you increase this process parameter, you’re increasing the output characteristic
High negative: still closely grouped
- Descends from left to right
- As you increase process parameter, you’re decreasing output characteristic
Low Correlations
- Still have a best fit line, but the data points are widely scattered
- The distance between points and the best fit line is large
- This is a weak relationship and should not be used for estimation
No Correlation
- Uniformly scattered data points with no discernible line
Non-Linear Correlation
- Points are still closely grouped, and move together
- But, prediction equation is non-linear (squiggly line)
Outliers
Points very far from the cluster.
Require more information to understand (inaccurate measurements? process failure?)
Normal probability plot
A graphical way of comparing two data sets based on empirical data
Usually built from scatter plots
Creating a probability plot
8 DAY CALCULATING COMPLAINTS
Start with cumulative frequency table.
We end up with cumulative probability
Then we graph on the probability plot (day on x axis, cumulative probability on y-axis). Then we draw the best fit line
Did we end up with a straight line? If so, it’s a normal distribution.
What question does a probability plot help answer?
Trying to understand whether or not you have a normal distribution
- If not, we either transform data or try different techniques
Why are probability plots beneficial?
Probability plot is beneficial over other graphical techniques because it can use small samples AND is easy to implement.
Histograms
A way to graphically display information from a frequency distribution
Answers:
- Is the distribution normal?
- If something happened in your process, you’ll see it in the data
- Also compare outputs from 2 different processes
How to construct a histogram
X-axis (depicts your mutually exclusive variable) can be either intervals from the frequency distribution OR a specified value.
Y-axis is what your frequency values are (number of calls dropped per day, frequency of defective parts per day)
Can be constructed from a check sheet
How to interpret a histogram
If the distribution is normal:
- What’s the dispersion/spread?
- Were there changes in the data?
- You can also see central tendency (mean, mode) & dispersion (range).
- Multiple peaks/bimodal/multimodal?
Mode = tallest peak of the distribution
Outliers - either on the far left or far right
Stem-and-leaf plots
Similar to a histogram
Break your data into groups
Stem (1st column) - the initial part of the data
Leaf (second column) is the last/final data point
EXAMPLE
We have a group of numbers between 453 and 527.
Stem - Larger two digits (hundreds and tens)
Leaf - final digit (ones place)
What is the advantage of a stem and leaf plot?
By looking at the “leaf” column, you can see the relative frequency
Can also see the distribution of your data
Can also compare 2 data sets (grades on 2 quizzes a week apart)
Benefits of stem and leaf plot
Allow for a quick overview of the data
Helps highlight outliers
Used for variable AND categorical data
Cons of stem and leaf plot
Not good for small data sets
Not good for very large data sets
Need a mid-range data set.
Box and whiskers diagram
AKA The box plot Useful in showing how much variation in your data because - Shows lower 25% - Middle 50% - Upper 50%
How spread out it is and where the data lies
Making a box and whiskers diagram
- Calculate median of data (middle)
- Calculate lowest whisker (bottom 25%)
- Calculate mean between median and lowest point, that’s the first quartile’s bottom.
- Do the same from the highest point to get the third quartile.
- Remaining middle is interquartile range (25%-75%)
How to interpret box and whiskers plots
Useful for comparing 2 different samples of data
- Compare 1st to second
Narrow interquartile range = process more in control
See outliers
Identify shape of interquartile range