AP exam flashcards
interpret standard deviations
- standard deviation accounts for variability from the mean*
height of students typically varied by about 3.2 inches from the mean height of 64 inches
scope of inference cause and effect
cause and effect conclusions can only be drawn if subjects were randomly assigned treatments and we find a statically significant difference
a difference is statistically significant if it is larger than what would be expected to happen by chance alone
generalizing to a larger population
we can generalize and a study to a larger population if we randomly select from that population.
however, sampling variably can affect estimates because if we conduct different samples of the same size from the same population we will produce different estimates
replication and control
2 out of 4 factors for a good experiment
replication - giving each treatment to enough subjects or units so that any difference in the effect of treatments can be distinguished from chance differences
control - keeping other variables the same for all groups especially variables that are likely to cause confounding(control helps reduce variability in the response variable)
experimental units, factors and levels, treatments
experimental units - objects for which the treatment is randomly assigned. when the unit is a person, they are often called “subjects”
factor - an explanatory variable that is manipulated and may cause a change in the response variable
level - different values of a factor
all combinations of levels are treatments
control groups and blinding
other 2 factors that contribute to a good experiment
control group - provide a baseline for comparing the effects of other treatments. A control group is often given an inactive treatment(placebo), active treatment, or no treatment
blind - when the subject doesn’t know which treatment they are receiving. the people recording or measuring the response variable don’t know they are blind. when both groups don’t know it is called “double-blind”
blocking and matched pairs design
before random assignment divide the experimental units into groups that would respond similarly. then randomly assign treatments within blocks.
a matched pairs design uses blocks of size 2 or gives both treatments to each subject in random order
random assignment and completely randomized designs
random assignment - create groups of experimental units that are roughly equivalent at the beginning of the experiment
if treatments are assigned to experimental units completely at random(no blocking), the result is a completely randomized design
simple random sample
of size n is chosen so that every group of n individuals in the population has an equal chance to be selected as the sample
bias
a statistical study shows bias if it is very likely to underestimate or overestimate the value you want to know
samples that can result in bias - convenience, voluntary, under coverage, non-response, and response bias
using a random table to select a sample
label all members of the population with the same number of digits
randomize and read the digits from left to right skipping any repeated numbers or numbers not in the interval or numbers
selects the individuals whose labels you find
choosing a model
choose the model whose residual plot has the most random scatter
if there is more than one model with a random scattered residual plot, choose the model with the largest coefficient of determinations, r2
population, census, sample
the population in a statistical study is the entire group of individuals we want information about
census collects information from every single person within the population
a sample is a subset of individuals from the population from which we collect data
experimental vs observational study
experimental study - researchers impose treatment(s) upon the experimental units. well designed experiments allow for cause-and-effect conclusions to be made
observational study - does not influence variables and the results cannot conclude cause and effect
what is a chi square distribution
a chi square distribution is defined by a density curve that takes only nonnegative values and is skewed to the right
as df increases the chi square distributions become more variable, less skewed and centered at a larger value (mean = df)
the chi square test statistic measures how different the observed counts are from the expected counts
inference for regression
Liner - association between variables is linear
Independent - observations, 10% condition if sampling without replacement
Normal - responses vary normally around the regression line for all x-values (or n > 30)
Equal SD - around the regression line for all x-values
Random - data from a random sample or randomized experiment
outlier rule
outliers > Q3 + 1.5(IQR)
outliers < Q1 - 1.5(IQR)
what is a resistant measure
a reassure measure is not affected by outliers
resistant measures: median, IQR, Q1, Q3
non resistant: mean, SF, range correlation, equation of LSRL
Interpret a Z-score
“Jessica;s test score was 2.3 standard deviations below the mean”
z = -2.3
z - score formula
z = value - mean/standard deviation
interpret standard deviation of residuals s
s measures the size of the typical residual
“The cost of a car typically varies by about $2375 from the price predicted by the LSRL with x = years”
residual formula
actual - predicted
interpreting a residual plot
- if there is no leftover curvature the model used to make the plot is appropriate
- if there is leftover curvature the model used to make the plot is not appropriate
making predictions/extrapolation
extrapolation is the use of a LSRL for prediction outside of the interval. The further we extrapolate the less reliable predictions
interpret slope and y intercept
slope - “The predicted cost of a car decreases by about $1285 for each additional year”
slope - the change in y when x increases by one unit
y intercept - “The predicted cost of a car is about $23,450 when it is x = 0 years old”
y intercept - the predcited value of y when x is 0
interpret a residual
” the car cost $1500 more than the price predicted by the LSRL with x = years”
working with a power model
interpret coefficient of determination(r2)
r2 measures the percent of variability in y that is accounted for by the LSRL of y on x
“48% of variability in the cost of a car is accounted for by the LSRL with x = years”
cluster sampling
split the population into groups(based on location) called cluster, randomly selefct cluster and include each member of the selected clusters in the sample
confounding
two variables are associated in such a way that their effects on the response variable cannot be distinguished
systematic random sampling
selected a sample from an ordered arrangement of the population by random selecting one of the first k individuals choosing every kth individual thereafter
k =
stratified random sampling
split the population into homogeneous(similar) groups(strata) based on anticipate response. selected an srs from each stratum and combine the srss to form the overall sample
outliers, high leverage, and influential points in regression
high leverage - a point with much larger or much smaller x values than the other points
outliers - a point that does not follow the pattern of the data and has a much larger residual(actual - predicted)
influential point - a point that if removed substantially changes the slope, y-intercept, correlation, r2, or standard deviation of the residuals
high leverage points and outliers can both be influential
how does shape affect measures of center
mean < median (Left Skew)
mean > median (Right Skew)
mean = median (Roughly Symmetric)
association
two variables have an association if knowing the value of one variable helps to predict the value of the other variable
discrete vs continuous variables
a quantitative variable is discrete if its possible values have gaps between them. ie (1, 2, 3, 4)
a quantitative variable is continuous if its possible values have no gaps between them and can take any value on the number line. ie(1, 1.1, 1.2, 1.3 … 1.7)
interpret r
correlation measures strength and direction
r is always between -1 and 1
close to zero = very weak
close to 1 or -1 = strong
exactly 1 or -1 = perfectly straight line
positive r = positive correlation
negative r = negative correlation
finding boundaries under a normal distribution
use invNorm and label inputs
empirical rule
finding area under a normal distribution
use normalcdf
standard normal distribution
the area of a normal distribution will always be 0 and SD 1
describing/comparing distributions of quantitative data
use SOCV
Shape
Outliers
Center
Variability
parameter vs statistic
a parameter is always about a population
a statistic is always about a sample
parameters include the population mean, population standard deviation, population proportion
statistics include the sample mean, sample standard deviation, sample proportion
marginal, joint, and conditional relative frequency
marginal - the values on the edge of the 2-way table
joint - the values that make up the body of the table
conditional - the joint frequency/condition
ex: the probability that a survey respondent likes basketball the most, given that the respondent is male. 15(males who like basketball)/48(males because that’s the condition)
percentiles
the pth percentile of a distribution is the value that has p% of the observations less than or equal to that value
example: a student who scores at the 90th percentile got the same score or a greater score than 90% of the other test takers
describing an association in a scatterplot
use DUFS to describe association in a scatterplot
Direction - positive, negative, no association
Unusual features - clusters, other points
Form - linear, nonlinear
Strength - weak, moderate, strong
“There is a moderate, positive, linear association between height and weight for HS students”
empirical rule
if a distribution of data is approximately normal then,
- 68% of the data will be within 1 SD of the mean
- about 95% of the data will be within 2 SD of the mean
- about 99.7% of the data will be within 3 SD of the mean
transforming data/ effect of changing units
adding “a” to every member of a data set adds “a” to the measures of center/position but does not change the measures of variability or shape
multiplying every member of a data set by a positive constant “b” multiplies the measures of center/position by “b” and multiplies most measures of variability by “b”, but does not change shape
density
a density curve models the distribution of a quantitative variable with a curve that is always on or above the horizontal axis and has an area exactly 1 underneath
the area under the curve and above any interval of values on the horizontal axis estimates the proportion of all observations that fall in that interval