Chapters 1-3 Flashcards
How to calculate the mean?
How to calculate the median?
How to calculate quartiles and a 5-number summary?
Excel doesn’t have a single command to calculate a five number summary, but you can find a 5 number summary in Excel by using basic functions to calculate the median, max, minimum, Q1 and Q3.
Step 1: Type your data into a single column in Excel. For this example question, type the values into cells A1 through A12.
Step 2: Click an empty cell, then type “MAX(A1:A12)” (without parentheses). A1 is the first cell your data is in, A12 is the last. The two values are separated by a colon. Press “Enter” to find the max: 33.
Finding the MAX in a five number summary.
Step 3: Repeat Step 2 for the minimum: “MIN(A1:A12)”.
Step 4: Repeat Step 2 for the median: “MEDIAN(A1:A12)”.
Step 5: Repeat Step 2 for the first quartile: “QUARTILE(A1:12,1)”. The “1” after A1:A12 lets Excel know you want the first quartile.
Step 6: Repeat Step 2 for the third quartile: “QUARTILE(A1:12,3)”. The “3” after A1:A12 lets Excel know you want the third quartile.
How to calculate standard deviation?
The standard deviation measures spread by looking at how far the observations are from their mean.
Notice that the “average” in the variance divides the sum by 1 less than the number of observations, that is, n − 1 rather than n. The reason is that the deviations xi − x always sum to exactly 0, so that knowing n − 1 of them determines the last one. Only n − 1 of the squared deviations can vary freely, and we average by dividing the degrees of freedom total by n − 1. The number n − 1 is called the “degrees of freedom” of the variance or standard deviation. Many calculators offer a choice between dividing by n and dividing by n − 1, so be sure to use n − 1
Chapter 1.2 Summary
A numerical summary of a distribution should report its center and its spread or variability.
- The mean x and the median M describe the center of a distribution in different ways. The mean is the arithmetic average of the observations, and the median is the midpoint of the values.
- When you use the median to indicate the center of the distribution, describe its spread by giving the quartiles. The first quartile Q1 has one-fourth of the observations below it, and the third quartile Q3 has three-fourths of the observations below it.
- The five-number summary consisting of the median, the quartiles, and the high and low extremes provides a quick overall description of a distribution. The median describes the center, and the quartiles and extremes show the spread.
- Boxplots based on the five-number summary are useful for comparing several distributions. The box spans the quartiles and shows the spread of the central half of the distribution. The median is marked within the box. Lines extend from the box to the extremes and show the full spread of the data.
- The variance s2 and especially its square root, the standard deviation s, are common measures of spread about the mean as center. The standard deviation s is zero when there is no spread and gets larger as the spread increases.
- A resistant measure of any aspect of a distribution is relatively unaffected by changes in the numerical value of a small proportion of the total number of observations, no matter how large these changes are. The median and quartiles are resistant, but the mean and the standard deviation are not.
- The mean and standard deviation are good descriptions for symmetric distributions without outliers. They are most useful for the Normal distributions, introduced in the next section. The five-number summary is a better exploratory summary for skewed distributions.
What are the steps to explore data on a single quantitative variable?
- Always plot your data: make a graph, usually a histogram or a stemplot.
- Look for the overall pattern (shape, center, spread) and for striking deviations such as outliers.
- Calculate a numerical summary to briefly describe center and spread.
- Sometimes the overall pattern of a large number of observations is so regular that we can describe it by a smooth curve.
What are the characteristics of density curves?
A density curve is a curve that
- is always on or above the horizontal axis and
- has area exactly 1 underneath it.
A density curve describes the overall pattern of a distribution. The area under the curve and above any range of values is the proportion of all observations that fall in that range.
The median and mean of a density curve Our measures of center and spread apply to density curves as well as to actual sets of observations. The median and quartiles are easy. Areas under a density curve represent proportions of the total number of observations. The median is the point with half the observations on either side. So the median of a density curve is the equal-areas point, the point with half the area under the curve to its left and the remaining half of the area to its right. The quartiles divide the area under the curve into quarters. One fourth of the area under the curve is to the left of the first quartile, and three-fourths of the area is to the left of the third quartile. You can roughly locate the median and quartiles of any density curve by eye by dividing the area under the curve into four equal parts.
The mean of a set of observations is their arithmetic average. If we think of the observations as weights strung out along a thin rod, the mean is the point at which the rod would balance. This fact is also true of density curves. The mean is the point at which the curve would balance if made of solid material.
The median and mean are the same for a symmetric density curve. They both lie at the center of the curve. The mean of a skewed curve is pulled away from the median in the direction of the long tail.
What is the empirical rule of normal distributions?
The 68–95–99.7 Rule In the Normal distribution with mean μ and standard deviation σ:
- 68% of the observations fall within σ of the mean μ.
- 95% of the observations fall within 2σ of μ.
- 99.7% of the observations fall within 3σ of μ.
We abbreviate the Normal distribution with mean μ and standard deviation σ as N(μ, σ). For example, the distribution of weights in the previous example is N(9.12, 0.15).
What is a z-score and how do you calculate it?
Z-scores are standardized scores that tell you how many standard deviations above or below the mean a single data point or observation is.
Every data point has its own z-score, usually ranging from -3 to 3, meaning a score that represents that the data point is 3 standard deviations below the mean up to 3 standard deviations above the mean.
To calculate a z-score, subtract the population mean from the data point, and divide by the population standard deviation.
—>
[The distance the data point is above or below the mean] / the population standard deviation
This tells us how many standard deviations away it is.
This standardized score allows us to compare scores from different data sets, such as MCAT scores versus GMAT scores…which test did you do better on? Need to know the average score and the standard deviation for that test to tell how you did, given their different scales.
No matter where on the axis a data set lies (i.e. if the data set has a mean of -5 and std dev. of 4 or a mean of 7 and std dev. of 6), if the data sets are normally distributed, we can standardize the data distributions into what’s called the “Standard Normal Distribution” which is a normal distribution with mean = 0 and std dev. = 1. We can do this by using the same z-score formula on each data point.
IN EXCEL:
=STANDARDIZE(x, mean, standard_dev)
How to calculate the cumulative proportion under a normal curve (i.e. the proportion of observations in a distribution that lie at or below a given value. When the distribution is given by a density curve, the cumulative proportion is the area under the curve to the left of a given value)?
Calculate the z-score. Then go to a z-score table and look up the z-score. It will return a decimal that is the percent of data to the left of the z-score on a normal curve).
If you want to know the proportion of observations to the right of a given point, you must subtract the number from the z-score table from 1.
IN EXCEL: =NORMDIST(x,mean,std dev, TRUE)
What’s the formula for un-standardizing a score?
x = μ + zσ
In Excel, =Norm.Inv()
Chapter 1.3 summary
- We can sometimes describe the overall pattern of a distribution by a density curve. A density curve has total area 1 underneath it. An area under a density curve gives the proportion of observations that fall in a range of values.
- A density curve is an idealized description of the overall pattern of a distribution that smooths out the irregularities in the actual data. We write the mean of a density curve as μ and the standard deviation of a density curve as σ to distinguish them from the mean x and standard deviation s of the actual data.
- The mean, the median, and the quartiles of a density curve can be located by eye. The mean μ is the balance point of the curve. The median divides the area under the curve in half. The quartiles and the median divide the area under the curve into quarters. The standard deviation σ cannot be located by eye on most density curves.
- The mean and median are equal for symmetric density curves. The mean of a skewed curve is located farther toward the long tail than is the median.
- The Normal distributions are described by a special family of bell-shaped, symmetric density curves, called Normal curves. The mean μ and standard deviation σ completely specify a Normal distribution N(μ, σ). The mean is the center of the curve, and σ is the distance from μ to the change-of-curvature points on either side.
- To standardize any observation x, subtract the mean of the distribution and then divide by the standard deviation. The resulting z-score z = (x − μ) / σ says how many standard deviations x lies from the distribution mean.
- All Normal distributions are the same when measurements are transformed to the standardized scale. In particular, all Normal distributions satisfy the 68–95–99.7 rule, which describes what percent of observations lie within one, two, and three standard deviations of the mean.
- If x has the N (μ, σ) distribution, then the standardized variable z = (x −μ)/σ has the standard Normal distribution N(0, 1) with mean 0 and standard deviation 1. Table A gives the proportions of standard Normal observations that are less than z for many values of z. By standardizing, we can use Table A for any Normal distribution.
- The adequacy of a Normal model for describing a distribution of data is best assessed by a Normal quantile plot, which is available in most statistical software packages. A pattern on such a plot that deviates substantially from a straight line indicates that the data are not Normal.
How to tell whether a set of data is normally distributed?
In addition to plotting a histogram and looking for a normal curve pattern, do a Normal Quintile Plot and look for a straight line:
- Order the data from smallest to largest, and calculate each data point’s percentile within the data set (can do this in Excel with percentilerank)
- Go to the Z-score table and look up each data point’s percentile, and write the z-score for that percentile. These are the Normal Scores.
- Create a scatterplot of the data points’ z-scores versus the z-scores for the percentiles, to see whether the percentiles produce aligned z-scores that form a straight line.
Any Normal distribution produces a straight line on the plot because standardizing turns any Normal distribution into a standard Normal distribution.
If the points on a Normal quantile plot lie close to a straight line, the plot indicates that the data are Normal. Systematic deviations from a straight line indicate a non-Normal distribution. Outliers appear as points that are far away from the overall pattern of the plot.
What are Normal Scores?
Percentiles of the standard normal distribution N(mean,std dev).
For example, z = −1.645 is the 5% point of the standard Normal distribution, and z = −1.282 is the 10% point.
Chapter 1 Summary
A. Data
- Identify the cases and variables in a set of data.
- Identify each variable as categorical or quantitative. Identify the units in which each quantitative variable is measured.
B. Displaying Distributions
- Make a bar graph, pie chart, and/or Pareto chart of the distribution of a categorical variable. Interpret bar graphs, pie charts, and Pareto charts.
- Make a histogram of the distribution of a quantitative variable.
- Make a stemplot of the distribution of a small set of observations. Round leaves or split stems as needed to make an effective stemplot.
C. Inspecting Distributions (Quantitative Variable)
- Look for the overall pattern and for major deviations from the pattern.
- Assess from a histogram or stemplot whether the shape of a distribution is roughly symmetric, distinctly skewed, or neither. Assess whether the distribution has one or more major peaks.
- Describe the overall pattern by giving numerical measures of center and spread in addition to a verbal description of shape.
- Decide which measures of center and spread are more appropriate: the mean and standard deviation (especially for symmetric distributions) or the five-number summary (especially for skewed distributions).
- Recognize outliers.
D. Time Plots
- Make a time plot of data, with the time of each observation on the horizontal axis and the value of the observed variable on the vertical axis.
- Recognize patterns in a time plot.
E. Measuring Center
- Find the mean x of a set of observations.
- Find the median M of a set of observations.
- Understand that the median is more resistant (less affected by extreme observations) than the mean. Recognize that skewness in a distribution moves the mean away from the median toward the long tail.
F. Measuring Spread
- Find the quartiles Q1 and Q3 for a set of observations.
- Give the five-number summary and draw a boxplot; assess center, spread, symmetry, and skewness from a boxplot.
- Using a calculator or software, find the standard deviation s for a set of observations.
- Know the basic properties of s: s ≥ 0 always; s = 0 only when all observations are identical and increases as the spread increases; s has the same units as the original measurements; s is pulled strongly up by outliers or skewness.
G. Density Curves
- Know that areas under a density curve represent proportions of all observations and that the total area under a density curve is 1.
- Approximately locate the median (equal-areas point) and the mean (balance point) on a density curve.
- Know that the mean and median both lie at the center of a symmetric density curve and that the mean moves farther toward the long tail of a skewed curve.
H. Normal Distributions
- Recognize the shape of Normal curves and be able to estimate by eye both the mean and the standard deviation from such a curve.
- Use the 68–95–99.7 rule and symmetry to state what percent of the observations from a Normal distribution fall between two points when the points lie one, two, or three standard deviations on either side of the mean.
- Find the standardized value (z-score) of an observation. Interpret z-scores and understand that any Normal distribution becomes standard Normal N(0, 1) when standardized.
- Given that a variable has the Normal distribution with a stated mean μ and standard deviation σ, calculate the proportion of values above a stated number, below a stated number, or between two stated numbers.
- Given that a variable has the Normal distribution with a stated mean μ and standard deviation σ, calculate the point having a stated proportion of all values above it. Also calculate the point having a stated proportion of all values below it.
- Assess the Normality of a set of data by inspecting a Normal quantile plot.