AP Statistics Exam Review Flashcards
What graphs are appropriate for quantitative data?
dotplot, histogram, stemplot, boxplot
What graphs are appropriate for categorical data?
bar graph, pie graph (not in AP curriculum), 2-way table (I know–it’s not a graph)
When creating a graph by hand, always include these:
- labels for both axes.
2. numerical scales, with equal intervals labeled, on BOTH axes.
What is the difference between marginal and conditional distributions?
Marginal distributions are made from numbers in the MARGINS. Conditional distributions are from single rows or columns that are NOT in the margins.
When describing (or comparing) distributions, ALWAYS address:
shape, center, spread and outliers (SOCS)
What can usually be determined from a boxplot?
range, IQR, quartiles, MEDIAN
What can NOT be determined from a boxplot?
shape (at least not a complete description) and sample size
When writing COMPARISON statements, always be sure to
use COMPARATIVE language (“larger than…”, “both have…”, “more skewed than…,” “neither shows…,” etc.)
Stem plots require this for full credit:
A key. (Example: 4|3 = 43).
Features of a histogram:
equal bar (bin) widths, x-axis is a continuous number line, different bin widths may show different features of a distribution, Xscl in TI Window will change bin widths
When to use Mean vs. Median?
Generally use means with non-skewed data. Use medians with skewed data or data with outliers.
When is the mean higher than the median?
Generally, this happens when the data is skewed right, or has high outliers.
How can you estimate mean and median from a distribution?
The mean is the “balance point” if the distribution was made out of a solid material. The median is the “equal areas” location in a distribution.
What is standard deviation?
It is the “typical” (or average) deviation from the mean in a dataset.
When should IQR be used as a measure of spread instead of standard deviation?
IQR should be used when the data is skewed or has outliers. Standard deviation should be used when the data is roughly symmetric with no outliers.
What is the rule for determining outliers?
An outlier is more than 1.5 IQR’s away from the nearest quartile.
What is the percentile of x?
The percentage of the data that is less than x in a distribution (“less than or equal” is also acceptable).
What is “frequency” vs. “relative frequency?”
Frequency is counts (whole numbers). Relative frequency is percentage.
What is a standardized score?
The number of standard deviations from the mean.
How do you calculate a z-score?
z = (x – mean) ÷ (standard deviation)
What statistics/measurements change when you multiply a dataset by a constant?
All statistics/measurements change by this same factor (or divisor).
What statistics/measurements change by adding (or subtracting) a constant to all data?
Only measures of location change. Measures of spread are not affected.
What does a density curve show?
Overall patterns of a distribution are depicted. Also, the area under the curve is 1 (100%), so percentiles can also be depicted.
N(34, 4.2) means
a normal distribution with mean 34 and standard deviation 4.2
If a dataset has a mean of 34 inches, what will be the units of the standard deviation?
inches
What is the difference between a normal density curve and normal data?
The normal density curve will be perfectly normal and symmetric. Data will NEVER be perfectly normal, only APPROXIMATELY normal.
The three famous area “rules of thumb” for a normal density curve:
68% of the area is within 1 SD of the mean, 95% of the area is within 2 SD’s of the mean, and 99.7% of the area is within 3 SD’s of the mean.
When reading a z-table, you find that a z-score of 0.62 has a table value of 0.7324. What does this mean?
73.25% of the area under a normal model lies below 0.62 standard deviations above the mean.
normalcdf vs invNorm on a TI calculator
normalcdf can find area/percents under a normal model. infNorm can find a z-score given the area to the left (in decimal form)
When calculating answers using a Normal model, be sure to communicate:
The type of model you are using (Normal), the parameters (mean, SD), direction of shading, sufficient calculations, the answer in context.
What are the general names of the variables on a scatterplot?
x-axis: explanatory variable
y-axis: response variable
What should always be included in a description of a scatterplot?
direction, strength, form, outliers, CONTEXT
When should correlation be used and what does it measure?
Correlation should ONLY be used on linear data. It is a measurement of strength and direction of the association between two quantitative variables.
When stating the LSRL, include these:
- “predicted” (put a hat over the y-variable)
- correct slope and y-intercept values
- context
A LSRL was computed for value ($) and miles driven of a certain make of car. Interpret a slope of -0.134
For every additional 100 miles driven, the value of this car is estimated to decrease by $13.40, according to the LSRL.
What’s the difference between interpolation and extrapolation?
Interpolation is making a prediction within the range of the data. Extrapolation is making a prediction outside the range of the data.
What is a residual?
observed value – predicted value
Points above the LSRL have positive residuals; points below have negative residuals.
What is the best way to justify that a linear model is APPROPRIATE for a scatterplot?
Look at the residual plot. If there is random scatter (no overall curve pattern), then a linear model is appropriate.
How can you tell if a linear association is STRONG or not?
- Look a how close the dots are to the LSRL. The closer they are, the more linear the relationship. 2. Look at r (correlation). Generally between 0.8 and 1 is strong.
How can you tell how well a LSRL model FITS the data?
Look at s (standard deviation of the residuals) and r-squared. Lower s’s and higher r^2’s would generally mean a better fit.
What is r-squared?
The percent of variation in the response variable that is accounted for by the LSRL on the explanatory variable.
Outliers in the x-direction in a scatterplot:
typically influence the LSRL and the correlation.
A strong association in a scatterplot does not automatically imply
a cause-and-effect relationship.
How are the terms population, parameter, sample and statistic related?
The population is the entire group of interest. A parameter is a measurement from a population. A sample is a subset of a population; a measure from a sample is a statistic.
What is bias?
SYSTEMATIC error in a sample (typically an overestimate or an underestimate)
People who choose to be in a sample:
voluntary response sample
Generally the best way to get a representative sample:
random sample
What makes a simple random sample (SRS) of size n unique?
It guarantees that every SUBSET of size n from the population has an equal chance of being selected.
If we want to pick three different students from a group of 10 students, and we use a calculator’s random integer function, what instructions are necessary?
- Number students from 1-10.
- Do RandInt(1, 10) on a calculator.
- IGNORING REPEATED NUMBERS, select three students.
What is stratified sampling?
Putting subjects into homogenous groups and then selecting a SRS of 20 from each group is called:
What is cluster sampling?
Putting subjects into heterogenous groups and then randomly selecting several of these groups for your sample.
What does a random sample help guarantee?
- We can generalize findings to the population.
- We avoid bias (systematic error in the results)
- We can invoke probability laws and draw conclusions.
What is undercoverage?
When some members of the population are not chosen in a sample.
What is nonresponse?
When a chosen individual in a sample does not or chooses not to respond.
Observational study vs. experiment
In an experiment, treatments are imposed on subjects and measurements are taken. In an observational study, subjects are only observed and measured.
What is confounding?
This occurs when two variables are associated in such a way that their effects on a response variable cannot be distinguished from each other.
Factors, levels and treatments in an experiment about 3 dosages of fertilizer on tomato plants and two dosages of water.
Two factors: fertilizer and water
Three levels of fertilizer and two levels of water
Six treatments in all (all six combinations of fertilizer, H2O.
Sixteen tanks, each with 20 fish, are set up for a fish food experiment. Four types of fish food are randomly given to 5 tanks each. What are the experimental units?
The 16 tanks. The treatments were assigned to the tanks, not to the individual fish.
Describe how you would randomly assign 300 cats to three treatment groups.
One of many correct ways: assign each cat a number 1-300. Put slips of paper with #s 1-300 in a large box and shake well. The 1st 100 #’s picked will be given treatment 1, the 2nd 100 #s will be treatment 2, and the rest will be treatment 3.
What are the principles of good experimental design?
Random assignment, control, replication, comparison (and sometimes blocking)