Chapter 2 - Descriptive Statistics Flashcards
John Tukey
- 1915 - 2000
- exploratory data analysis (EDA) = boxplots, stem-and-leaf plots
- coined terms such as bit and software
Features of a good numeric or graphic form of data submission
- self-contained
- understandable without reading the text
- clearly labeled of attributes with well-defined terms
- indicate principal trends in data
Measures of location
- also known as measures of central tendency
- data summarization is important before any inferences can be made
- measure of location is useful for data summarization that defines the center or middle of the sample
Arithmetic mean limitation
- oversensitive to extreme values
- in which case, it may not be representative of the location of the majority of sample points
Symmetric distribution
arithmetic mean is approximately the same as the median
Positively skewed distribution
- tail end is on the right side
- arithmetic mean tends to be larger than the median
Negatively skewed distribution
- tail end is on the left side
- arithmetic mean tends to be smaller than the median
Mode
- the most frequently occurring value among all the observations in a sample
- data distributions may have one or more modes (unimodal, bimodal, trimodal, etc.)
Range
- the difference between the largest and smallest observations in a sample
- range is very sensitive to extreme observations or outliers
- larger the sample size n, the larger the range tends to be and the more difficult the comparison between ranges from data sets of varying sizes
Quantiles or percentiles
- a better approach than range to quantifying the spread in data sets is percentiles or quantiles
- percentiles are less sensitive to outliers and are not greatly affected by the sample size
Standard deviation
standard deviation is a reasonable measure of spread if the distribution is bell-shaped
Grouped data
- when sample size is too large to display all the raw data, data are frequently collected in grouped form
- the simplest way to display the data is to generate a frequency distribution using a statistical package
Frequency distribution
- frequency distribution = ordered display of each value in a data set together with its frequency
- if the number of unique sample values is large, then a frequency distribution may still be too detailed
- if the data is too large, then the data is categorized into broader groups
Types of grouped data
- bar graphs
- stem and leaf plots
- box and whisker plot
- scatter plot
- histogram
Bar graphs
- identity of the sample points within the respective groups is lost
Stem and leaf plots
- easy to compute the median and other quantities
- each data point is converted into stem and leaf
- the collection of leaves indicates the shape of the data distribution
Box and whisker plot
- uses the relationships among the median, upper quartile, and lower quartile to describe the skewness or symmetry of a distribution
- a vertical bar connects the upper quartile to the largest non-outlying value in the sample
- a vertical bar connects the lower quartile to the smallest non-outlying value in the sample
Box and whisker plot (symmetric)
- upper and lower quartiles should be approximately equally spaced from the median
Box and whisker plot (positively skewed)
- upper quartile is farther from the median than the lower quartile
Box and whisker plot (negatively skewed)
- lower quartile is farther from the median than the upper quartile
Box and whisker plot (outlying value)
- x > upper quartile + 1.5 IQR
- x < lower quartile - 1.5 IQR
Box and whisker plot (extreme outlying value)
- x > upper quartile + 3.0 IQR
- x < lower quartile - 3.0 IQR