non highlighted Flashcards
categorical variable
a categorical variable is placed an individual into one of several groups or categories
quantitative variable
a quantitative variable has numerical values and it makes sense to find the average value
association
there is an association between two variables if knowing the value of one variable helps predict the value of the other
mean
average value of the observation
median
midpoint of the values, also called Q2
first third quartiles
Q1 has about one-fourth of the observations below it, and Q3 has about three fourths of the observations below it
interquartile range
IQR is the range of middles 50% of the observations IQR =Q3-Q1
standard Deviation
measures the typical distance of the values in a distribution from the mean
variance
average squared deviation
shape
typical shapes of a distribution are roughly symmetric, skewed left and skewed right
center
mean for roughly symmetric distributions, median for skewed distributions
spread
standard deviation for roughly symmetric distributions, IQR for skewed distributions. Range = man-min as a last resort
transforming data by add/subtract a
measure of center (median and mean) and location (quartiles and percentiles) change by a measure of spread don’t change
transforming data by multiply/ divide b
measure of center, location, and spread change by b
Density curve mean and median
the mean is the balance point of the curve. The median divides the area under the curve in half
uniform distribution
a distribution that takes constant height over some interval of values
68-95-99.7 rule
percent of observations that lie within one tow and three standard deviations of the mean in a normal curve
normal probability plot
if the normal probability plot is roughly linear, then the data is apporiximately normal
if the normal probability is not roughly linear then the data is not approximately normal
scatterplot
displays the relationship between two quantitative variables measured on the same individuals
explanatory variable, factor, response variable
if we think that a variable x may help explain, predict or even cause changes in anohter variable y, we call x an explanatory variable and y a response variable
correlation r
meaures the direction and strength
r has no units, is between -1 and +1 and is not the value of the slope
correlation and causation
correlation does not imply causation, no matter how strong there may be other confounding variables
least squares regression line
the straight line y hat = a+bx that minimizes the sum of the squares of the veritcal distances of the observed points from the line
slope b
b is the predicted change in y when x increases by 1 unit in context
y intercept a
predicted resonse y hat value when the explanatory variable x equals 0, in context
extrapolation
avoid extrapolation the use of a regression line for prediction using values of the explanatory variable outside the range of the data
residual
y- y hat the difference between the observed and predicted values of y
influentials
outliers that substantially change the correlation or the regression line’s slope or y intercept
census
census collects data from every individual in the population
convenience sample
choose individuals who are easiest to reach
voluntary response sample
individyals choose to join the sample in response to an open invitation key terms phone in survey, TV survey
simple random sample
SRS uses chance prosses to give every possible sample of a given size the same chance to be chosen. choose an srs by labeling the members of the population and using slips of paper, technology or random digits table to select the sample
stratified random sample
divide the population into strata, groups of individuals that are similar in some way that might affect their responses. Then choose a separate SRS form each stratum and combine these SRSs to form the sample
cluster sample
divide the population in clusters, groups of individuals that are located near each other. Randomly select some of these clusters.. All the individuals in the chosen clusters are included in the sample
undercoverage
when some members of the population cannot be chosen to be in the sample
reponse bias
when a systematic pattern of inaccurate answers leads to resonse bias
nonresponse bias
when people can’t be contacted or refuse to answer
wording bias
wehn confusing or leading questions introduce stron gbias
observational study
gathers data on individuals as they are
experiment
deliberatly imposes treatments on experimental units
experimental units
each of the individuals to which treatments are applied. Human experimental units are called subjects
confounding
variables are confounded whe their effects on a response variable can’t be distinguished from that of the explanatory variable
completely randomized design
all experimental units are assigned to the treatments completely by chance
placebo
a fake treatment for the control group. That prevents confounding due to the placebo effect, in which some patients get better because they expect the treatment to work.
double blind experiment
neither the subjects nor those interacting with them and measuring their responses know who is receiving which treatment. If one party knows and the other doesn’t then the experiment is single blind
randomized block design
use blocks of experimental units that are similar with respect to a variable that is expected to affect the response. Treatments are assigned at random within each block. Responses are then compared within each block and combined with the reponses of other blocks after accounting for the differences between the blocks
matched pairs design
in some matched paris designs, each subject receives both treatments in a random order. in others, two very similar subjects are paired, and the two treatments are randomly assigned within each pair
mutually exclusive and independence
if two events are mutually exclusive, they cannot also be independent
probabilty distribution
the probabilty distribution of a random variable gives its possible values with gaps between
continuos random variable
a continuous random variable x takes all values in an interval of numbers. the probability distribution of x is described by a density curve. The probability of any event is the area under the density curve and above the values of x that make up the even
population parameter/ sample statistics
a parameter is a number that descrives a population. To estimate an unknown parameter, use a statistic calculated from a sample
sampling distribution
the sampling distribution of a statistic a statistic describes the values of the statistic in all possible samples of the same size from the same population
unbiased estiamator
a statistic is an unbiased estimator if the center (mean) of its sampling distribution is equal to the true value of the parameter