chapter 2: describing distributions with numbers Flashcards
measures
results of functions applied to the data
n
the number of observations in our dataset
mode
value that appears most often
we call the dataset “bimodal” or “multimodal” when…
when many values appear the same number of times, and sometimes there will be two or more modes
𝑥𝑖
the value of the 𝑖th observation in an ordered dataset
The median M
the observation that has just as many observations to the left of it as to the right of it, or the value in our dataset that is greater than just as many values in our dataset as it is less than. To find its location (not its value) you use (n+1)/2.
The minimum (or min) and maximum (or max)
the first and last in the list–or the smallest and the greatest values in our dataset - respectively
range
The difference between the max and the min
The first and third quartiles, (𝑄1 and 𝑄3)
the median of the values less than the median and the median of the values greater than the median, respectively. You calculate quartiles the way you calculate the median M
five-number summary
a listing of these five values: minimum, Q1, median, Q3, and maximum)
box plot
A visual representation of the five-number summary
inter-quartile range (IQR)
the difference between the third and first quartiles,
or
IQR = 𝑄3 − 𝑄1
Outlier Rule
If an observation has a value greater than 𝑄3 + (1.5 × 𝐼𝑄𝑅) or less than 𝑄1– (1.5 × 𝐼𝑄𝑅), then it can be considered an outlier
The five-number summary is ideal for [blank]
skewed data or data with outliers
True or false: Boxplots of multiple populations can be graphs together to compare their means and spreads
True
mean
an alternate measure of center, and will have the same units as our observations. Notice that when a median or quartile falls between two observations, we use the mean of their values
the mean x̄ formula
x with a line over it (the mean) = 1 divided by n (the number of observations) times—i.e. the following divided by n— capital Sigma (the sum of) x(dropped I) (each observation in the ordered) dataset. In other words, the average of the observations.

standard deviation s
an alternate measure of variability, and will also have the same units as the observations. Standard deviation is an “average” of how far observation values are from the mean of the dataset. It is the square root of the variance s2
the variance of the dataset s2
The variance s2 of a set of observations is an average of the squares of the deviations of the observations from their mean.
the formula for standard deviation
the square root of: 1 divided by n(number of observations) minus 1times—i.e. the following divided by n-1)—Sigma(the sum of) x(dropped i) (each observation) minus the mean, squared

s
The standard deviation. This measures variability about the mean and should be used only when the mean is chosen as the measure of center. s is always zero or greater than zero. s = 0 only when there is no variability. s has the same units of measurement as the original observations.
resistant measures
depend only upon the ordering of the data
non-resistant measures
depend on the particular values of the observations (mean, standard deviation, and variance)
the mean and the median will coincide if…
our distribution is symmetric
the mean will be further out in the tail if…
our distribution is skewed
proportions for our dataset
e.g. the number of observations witha value less than a given value over the size of the dataset
Σ
(capital sigma) means add them all up
x̄
the mean, or numeric average of the observations, add the values and divide by number of observations

True or false: the mean x̄ is resistant to outliers
True. Because the mean cannot resist the influence of extreme observations, we say that it is not a resistant measure of center.
n - 1
The degrees of freedom of the variance or standard deviation. When finding the variance, we divide the sum by one fewer than the number of observations. The reason is that the deviations xi - x̄ always sum to exactly 0 so that knowing n − 1 of them determines the last one.
choosing a summary (the five number summary vs. x̄ and s)
The fivenumber summary is usually better than the mean and standard deviation for describing a skewed distribution or a distribution with strong outliers. Use x̄ and s only for reasonably symmetric distributions that are free of outliers.
what should you do when you find an outlier?
try to find an explanation for it
What is the four step process?
- STATE: What is the practical question, in the context of the real-world setting?
- PLAN: What specific statistical operations does this problem call for?
- SOLVE: Make the graphs and carry out the calculations needed for this problem.
- CONCLUDE: Give your practical conclusion in the setting of the real-world problem.