Chapter 3 Numerically Summarizing Data Flashcards
Measures of Central Tendency
Give a feel for where the center of gravity of the data set is: Mean, Median, Mode
Arithmetic Mean
The average
Median
the value that lies in the middle of the data set when arranged in ascending order
Mode
The most frequent observation of the variable that occurs a the data set
A data set can have no mode, one mode, or more than one mode
The symbol for the mean of a population
The Greek letter mu (μ)
The symbol for the mean of a sample
(X ̅) “X-Hat”
The symbol for the median of a population
M
The symbol for the median of a sample
m
Relation between the mean, median, and a distribution shape that is skewed to the left
Mean is substantially smaller than the median
Relation between the mean, median, and a distribution shape that is symmetric
Mean roughly equal to median
Relation between the mean, median, and a distribution shape that is skewed to the right
Mean substantially larger than median
What does it mean when it is said that a data set is resistant?
Extreme values (very large or small) relative to the data do not affect its value substantially
Outlier
A data point that differs significantly from other observations. Results in a skewed distribution
What is the better measure of central tendency when the distribution is skewed?
The median
Measures of Dispersion
Show the degree to which the data in a population or sample is spread out: range, standard deviation, variance, interquartile range
Range
The difference between the largest data value and the smallest data value in a data set. Denoted as R
Range is not resistant to outlier values
Population Variance
The sum of the squared deviations from the population mean divided by the number of observations in the population, N. Denoted by the greek letter sigma squared (σ^2)
Population Standard Deviation
The positive square root of the population variance. Denoted by the Greek letter sigma (σ).
The population variance and standard deviation are
Parameters
Sample Variance
The sum of the squared deviations from the population mean divided by the size of the sample MINUS 1 (n - 1). Denoted by s squared (s^2).
Sample Standard Deviation
The positive square root of the sample variance. Denoted by s
The sample variance and standard deviation are
Statistics
When calculating the sample variance, the denominator is
n - 1
The Empirical Rule
States that, if the summary measures of mean (μ) and standard deviation (σ) are known, and if the distribution is approximately bell-shaped:
≈ 68% of the data will lie within ±1σ of the mean
≈ 95% of the data will lie within ±2σ of the mean
≈ 99.7% of the data will lie within ±3σ of the mean
Outlier for a bell-shaped distribution
Any data point less than -3 standard deviations (3σ) from the mean or more than 3σ from the mean
Z-Score
Represents the distance that a data value is from the mean in terms of the number of standard deviations. We find it by subtracting the mean from the data value and dividing this result by the standard deviation. Round z-scores to the nearest hundredth.
Z-Score for a data value in a population
= (x-μ)/σ, where μ is the population mean and σ is the
population standard deviation
Z-Score for a data value in a sample
z= (x-X ̅)/s, where X ̅ is the sample mean and s is the sample standard deviation
kth percentile
A value such that at least k percent of the observations are less than or equal to this value, and at least (100-k) percent of the observations are greater than or equal to this value. Denoted Pk
1st quartile
Denoted Q1, divides the bottom 25% of the data from the top 75%. Therefore, the 1st quartile is equivalent to the 25th percentile.
2nd quartile
Denoted Q2, divides the bottom 50% of the data from the top 50% of the data, so that the 2nd quartile is equivalent to the 50th percentile, which is equivalent to the median.
3rd quartile
Denoted Q3, divides the bottom 75% of the data from the top 25% of the data, so that the 3rd quartile is equivalent to the 75th percentile.
Method for determining quartiles using Excel
To find Q1: =QUARTILE.EXC (highlight the data, 1)
To find Q2 (The Median, M): =QUARTILE.EXC (highlight the data, 2)
To find Q3: =QUARTILE.EXC (highlight the data, 3)
Method for determining quartiles by Inspection
If the data set is relatively small, the direct “by inspection” method can be used:
Step 1: Arrange the data in ascending order.
Step 2: Determine the median, M, or second
quartile, Q2 .
Step 3: Divide the data set into halves: the
observations below (to the left of) M and the
observations above M. The first quartile, Q1 , is the
median of the bottom half, and the third quartile, Q3,
is the median of the top half.
Interquartile range
Defines the range of the middle 50% of the observations in a data set. Denoted as IQR = Q3 – Q1
When is it best to use the median as the measure of central tendency and the interquartile range as the measure of dispersion and why?
When the distribution of data is highly skewed or contains extreme observations; because these measures are resistant.
Method for checking a data set for outliers
Step 1. Determine the first and third quartiles of the data.
Step 2: Compute the interquartile range.
Step 3: Determine the fences. Fences serve as cutoff
points for determining outliers.
Lower fence = Q1 − 1.5 (IQR) Upper fence = Q3 + 1.5 (IQR)
Step 4: If a data value is less than the lower fence or greater than the upper fence, it is considered an outlier.
Five-number summary
Consists of the smallest data value, Q1, the median, Q3, and the largest data value. Used to learn information about the extremes of the data set.
Method for constructing a box plot using the TI-84 calculator
Step 1: Type data into L1
Step 2: 2nd > STAT PLOT
Step 3: Select PLOT 1 and set it to ON
Step 4: Select the box plot with outliers graph (4th from left)
Step 5: Press GRAPH button
Step 6: If graph is not visible: ZOOM > 9: ZOOM STAT
Resistant
A numerical summary of data is said to be resistant if extreme observations (very large or small) relative to the data do not affect its value substantially.
Multimodal
Describes a data set that has three or more values that occur with the highest frequency
Bias
Occurs whenever a statistic consistently underestimates or overestimates a parameter
Degrees of freedom
For the sample standard deviation, we call n−1 the degrees of freedom because the first n−1 observations have freedom to be whatever value they wish, but the nth observation has no freedom. It must be whatever value forces the sum of the deviations about the mean to equal zero.
Describe the Distribution
Means to describe its shape (skewed left, skewed right, or symmetric), its center (mean or median), and its spread (standard deviation or interquartile range).
Boxplot
A graphical summary of quantitative data used to identify the shape of a distribution and outliers.
Bimodal
Describes a data set that has two values that occur with the highest frequency
Deviation about the mean
For the ith observation in a population: xi – μ.
For the ith observation in a sample: xi − x ̅
No mode
Occurs If no observation occurs more than once in a data set
Quartiles
Divide data sets into fourths, or four equal parts.
Measures of Position
z-sore, percentiles, outliers
Method for Checking for Outliers by Using Quartiles
Step 1. Determine the first and third quartiles of the data.
Step 2: Compute the interquartile range.
Step 3: Determine the fences: LF = Q1 - 1.5 (IQR);
UF = Q3 + 1.5 (IQR)
Step 4: If a data point value is less than the lower fence or greater than the upper fence, it is considered an outlier.