Midterm 1: Ch 1-10 Flashcards
What is statistics?
quantitative technology for empirical science β logic and methodology for the measurement of uncertainty, and for an examination of that uncertainty
What are the goals of statistics (2)?
- estimate the values of important parameters
- test hypotheses about those parameters
What is data?
measurements of one or more variables made on a collection of individuals
What is a variable?
characteristic measured on individuals drawn from a population under study
What are the two types of variables?
- response variable (dependent variable)
- explanatory variable (independent variable)
What is a response variable?
(dependent variable β y-axis) variable that we try to predict or explain from the explanatory variable
What is an explanatory variable?
(independent variable β x-axis) variable used to predict or explain the response variable
What are parameters?
descriptive measures of an entire population
- population parameters are constants
- ie. mean length of salmon
What are estimates?
descriptive measures of a sample
- random variables β change from one random sample to the next, from the same population
- ie. mean of some sample of salmon
Do samples look exactly like the population?
no
What is a sample of convenience?
collection of individuals that happen to be available at the time β biased
What is bias?
systematic discrepancy between estimates and the true population characteristic
What are the goals of estimation? (2)
- accuracy
- precision
What is accuracy?
on average gets the correct answer
- accurate = unbiased
- inaccurate = biased
What is precision?
gives a similar answer repeatedly
What are some determinants of precision (when unbiased)?
- sample size
- precision of instrument
Unbiased and Precise
- on average, answer is correct
- repeated samples/estimates have very similar results
Unbiased and Imprecise
on average, anwer is accurate, BUT each individual estimate is off
Biased and Precise
- most dangerous β may not even realize thereβs a problem, and may have a lot of false confidence in the answer
- repeated samples/estimates have very similar results, BUT average value of estimates is off
Biased and Imprecise
- on average, answer is incorrect
- unconfident in the answer, but best guess would be wrong anyways β not as deadly as being confident and wrong
What are properties of a good sample? (3)
- independent selection of individuals
- random selection of individuals
- sufficiently large
What is a random sample?
each member of a population has an equal and independent chance of being selected
What is independent sampling?
chance of an individual being included in the sample does NOT depend on who else is sampled
What is sampling error?
difference between the estimate and average value of the estimate
measurement of precision
Do smaller or larger samples have smaller sampling error?
larger samples β smaller sampling error
on average
What is high sampling error?
every new measurement is different each time we do it
low precision β large difference
What is low sampling error?
- higher precision β small differences
- low variance between different estimates (each time we do a study)
What are the two types of data?
- categorical variables (class or nominal variables)
- numerical variables (quantitative variables)
What are categorical variables?
fall into categories
What are the 2 types of numerical variables?
continuous: can be measured β ie. arm length, height, weight, age*
discrete: can be counted β ie. number of limbs, number of offspring, number of petals
What is a frequency table?
frequency is NOT a variable β not measuring, just gathering data
What graph do you use for graphing categorical variables?
bar graph
What graph do you use for graphing numerical variables?
- histogram
- cumulative frequency distribution (CDF)
What data do histograms graph?
continuous numerical variable
- no gaps between bars β conveys that these are continuous variables running together
- widths are the same
What is cumulative frequency of a value?
proportion of individuals equal to or less than that value
- 0 = none of the individuals are less than that value
- 1 = all individuals are less than that value
What is a contingency table?
describes association between two (or more) categorical variables by displaying frequencies of all combinations of categories
What graphs are used for graphing the association between two categorical variables?
- contingency table
- grouped bar graph
- mosaic plot
What data do mosaic plots use?
relative frequencies scaled to 1 β does NOT use discrete numebrs
width of bars indicates number of individuals in the treatment
What data do stacked bar plots use?
discrete numbers or frequency
What graphs are used for graphing the association between a categorical (x-axis) and numerical (y-axis) variable?
- multiple histogram
- cumulative frequency distribution (CDF)
- box plot
What graphs are used for graphing the association between two numerical variables?
scatter plot`
What are two common descriptions of data?
- location: central tendency
- width: spread β how variable the data is
What are 3 measures of location?
- mean
- median
- mode
What is the mean (or average)
add all numbers together and divide by total amount of data points β centre of gravity
What is the median?
odd number: middle measurement in a set of ordered data
even number: average of two middle numbers in a set of ordered data
What is the mode?
most frequent measurement
Why might the mean and median be different?
skewed data β lot of the weight is on one side of the distribution
Why might the mean and median be the same or similar?
symmetrical distribution of data β bell-shaped
Mean vs. Median
- mean has nice statistical properties, can be quantified easily using theories
- mean has good predictive behaviours
What are the 4 measures of width?
- range
- variance
- standard deviation
- coefficient of variation
What is the range?
maximum minus minimum
- poor measure of distribution width β useless in statistics
Is sample range a biased estimator of the true population range?
yes, smaller sample β lower estimates of range
- sample range is not expected to match population range
In the equation for variance, why do we square the value
if we took unsquared value, negative and positive deviations cancel out
What is sample variance?
unbiased estimator of population variance β used to try to learn about population variance
What is standard deviation?
positive square root of the variance
Ο: true standard deviation
s: sample standard deviation β unbiased estimator of population standard deviation
What is the coefficient of variation (CV)?
good for comparing distributions of different magnitudes
What is skew?
measurement of asymmetry β refers to pointy tail of distribution
right-skewed: pointy tail is on the right
left-skewed: pointy tail is on the left
Mean β Nomenclature
population parameter: Β΅
sample statistic: Θ²
Variance β Nomenclature
population parameter: Ο^2
sample statistic: s^2
Standard Deviation β Nomenclature
population parameter: Ο
sample statistic: s
Manipulating Means
Mean of Sum of Two Variables
E[X + Y] = E[X] + E[Y]
Manipulating Means
Mean of Sum of Variable and Constant
E[X + c] = E[X] + c
ie. temperature conversions
Manipulating Means
Mean of Product of Variable and Constant
E[c X] = c E[X]
ie. measurement conversions
Manipulating Variance
Variance of Sum of Two Variables
Var[X + Y] = Var[X] + Var[Y]
ONLY if X and Y are independent
Manipulating Variance
Variance of Sum of Variable and Constant
Var[X + c] = Var[X]
spread of data has not changed β variance is the same
ie. adding 10 cm to every measurement
Manipulating Variance
Variance of Product of Variable and Constant
Var[c X] = c^2 Var[X]
variance in units^2, therefore multiply by constant^2
What happens every time we take a sample from a population?
every sample will look different
What happens if we take many samples from a population?
samples will look similar to each other
How does variance change with sample size?
larger sample size = smaller variance of the sampling distribution of the mean
What is standard error of an estimate?
standard deviation of its sampling distribution
predicts the sampling error of the estimate
What is the problem with the equation for the standard error of the mean?
in most cases, we donβt know π β we only have a sample
What is the estimate of the standard error of the mean?
gives some knowledge of the likely difference between sample mean and true population mean
What is the 95% confidence interval?
provides a plausible range for a parameter
- all values for the parameter within the interval are plausible
- all values for the parameter outside the interval are unlikely
What is the 2SE rule-of-thumb?
interval that provides a rough estimate of 95% CI for the mean
assuming normally distributed population and/or sufficiently large sample size
Correct or Incorrect:
βwe are 95% confident that the population mean lies within the 95% CIβ
correct
Correct or Incorrect:
βthere is a 95% probability that the population mean is within a particular 95% CIβ
incorrect
What is pseudoreplication?
error that occurs when samples are not independent, but they are treated as though they are
ie. taking multiple measurements from one individual and using each as an individual of the sample
EXAMPLE:
- taking 10 measurements from each climber (6) to get 60 measurements
- to avoid pseudoreplication: take mean blood pressure for each climber, so that you have 6 pulse rates, one for each climber (n = 6)
What is the probability of an event?
its true relative frequency β proportion of times event would occur if we repeated same process over and over again
What does mutually exclusive mean?
when two events cannot both be true
Pr(A and B) = 0
What does independent mean?
when the occurrence of one event gives no information about whether the second event will occur
What is the probability distribution?
describes the true relative frequency of all possible values of a random variable
all probabilities have to sum to 1
What is the addition principle?
if two events A and B are mutually exclusive
Pr[A or B] = Pr[A] + Pr[B]
What is the probability of a range?
Pr[number β₯ 6] = Pr[6] + Pr[7] + Pr[8]β¦
What is the probability of βnotβ?
Pr[not rolling a 2] = 1 β Pr[rolling a 2] = 5/6
What is the general addition principle?
Pr[A or B] = Pr[A] + Pr[B] - Pr[A and B]
need to subtract Pr[A and B], otherwise itβll be counted twice
What is the multiplication principle?
if two events A and B are independent
Pr[A and B] = Pr[A] x Pr[B]
What is the general multiplication principle?
Pr[A and B] = Pr[A] Pr[B | A]
Pr[A and B] = Pr[B] Pr[A | B]
therefore, Pr[A] Pr[B | A] = Pr[B] Pr[A | B]
What are dependent events?
probability of one event depends on the outcome of another event
Are variables always independent?
no
What is the conditional probability of an event?
probability of that event occurring given that a condition is met
Pr[X|Y]
probability of X given Y (if Y is true)
Law of Total Probability
β
When is Bayesβ Theorom used?
when you want to flip conditional probability
What is hypothesis testing?
asks how unusual it is to get data that differ from the null hypothesis
- if the data would be quite unlikely under H0, we reject H0
- assumes random sampling
- about populations, but are tested with data from samples
What is the null hypothesis?
specific statement about a population parameter made for the purposes of argument
- simplest statement
- specific
- good H0 would be interesting if proven wrong
What is an alternate hypothesis?
represents all other possible parameter values except that stated in the null hypothesis
- statement of greatest interest
- non-specific
Steps of Hypothesis Testing
population β sample β estimate β test statistic
null hypothesis β test statistic
null hypothesis β construct a new population under H0 β imagined repeated sampling β sample from H0 and calculate test statistic β null distribution of test statistic
test statistic + null distribution of test statistic
- how weird would these data be if the null hypothesis were true?
- compare distribution from H0 to observed sample
- how likely would it be to obtain our data sample if H0 were true?
What is a statistic?
number calculated to represent/summarize the match between a set of data and the null hypothesis
can be compared to a general distribution to infer probability β for any given value for a test statistic, we can say how likely those possible outcomes are
What is a null distribution (sample distribution)?
probability distribution of alternative outcomes for a test statistic when a random sample is taken from a population corresponding to the null expectation
If H0 is true, do we expect variance between samples?
yes
need to evaluate range and distribution of possible test statistics we have sampled, if we sampled repeatedly
What is the P-value?
probability of getting the data, or something as or more unusual/extreme, if the null hypothesis were true
NOT probability H0 is true
NOT probability HA is true
How can we find P-values? (3)
- simulation
- parametric tests
- permutation
What is the significance level?
acceptable probability of rejecting a true null hypothesis
πΆ = usually 0.05
What does it mean to be statistically significant?
if p-value for a test is β€ πΆ, then H0 is rejected
What does it mean to be statistically insignificant?
if p-value for a test is > πΆ, then H0 is NOT rejected
How does sample size influence the range of test statistic we see under the null hypothesis?
larger sample β estimate has smaller confidence interval
larger sample β more power to reject a false null hypothesis
What is a Type I error?
rejecting a true null hypothesis
probability of Type I error is πΆ (significance level)
What is a Type II error?
not rejecting a false null hypothesis
probability of Type II error is π·
- what the real world looks like
- our sample size
- our πΆ
smaller π· = larger power a test has
What is power?
ability of a test to reject a false null hypothesis β how likely we will reject it
1 β π·
How are power and sample size related?
larger power = larger sample size (more information)
increase sample size β decrease standard deviation of null distribution β increase power to reject H0
What is a two-tailed test?
deviation in either direction would reject the null hypothesis
- most tests are two-tailed
- normally πΆ is divided into πΆ/2 on one side, and πΆ/2 on the other
What is a one-tailed test used?
only used when the other tail is nonsensical
ie. comparing grades on multiple choice test to that expected by random guessing
What is a critical value?
value of a test statistic beyond which the null hypothesis can be rejected
- we never βaccept the null hypothesisβ
Where in the 95% CI is the value proposed by the null hypothesis rejected?
in general, if a hypothesis test rejects a null hypothesis test (p < 0.05), the value proposed by the null hypothesis is outside the 95% confidence interval
2-Sample T-tests
β
2-Sample T-tests
The more different the sample means areβ¦.
(when taking into account sample spread and size, and assuming weβve randomly sampled), the less likely it is they were drawn from populations with the same mean
2-Sample T-tests
What would a P-value of 0.03 mean?
there is a 3% chance of getting means that are at least this different if theyβre drawn from populations with the same mean
2-Sample T-tests
Higher vs. Lower P-values
higher p-values:
- higher probability of 2 sample means being at least this different, if drawn from populations with same mean
- less evidence of differences between population means
lower p-values:
- lower probability of 2 sample means being at least this different, if drawn from populations with same mean
- more evidence of differences between population means
What is a confounding variable?
unmeasured variable that may be the cause of both X and Y
What is a proportion?
fraction of individuals having a particular attribute
What is a binomial distribution?
describes the probability of a given number of βsuccessesβ from a fixed number of independent trials
What are the 2 properties of binomial distributions?
- mean of number of successes
- variance of number of successes
What is the estimate of a proportion?
number of βsuccessesβ over total sample size
What are the 2 properties of sample proportions?
- mean
- variance
How does the standard error of the estimate of a proportion change with sample size?
larger sample β lower standard error
What is the name for the 95% CI for a proportion?
Agresti-Coull confidence interval
How does the Agresti-Coull confidence interval change with sample size?
larger sample β more symmetrical distribution
For the Agresti-Coull confidence interval, what are the +2 and +4 factors?
βfudge factorsβ that are there for more asymmetrical distributions
What is Murphyβs law?
anything that can go wrong will go wrong
What is the binomial test?
uses data to test whether a population proportion (p) matches a null expectation for the proportion
H0: relative frequency of successes in the population is p0
HA: relative frequency of successes in the population is not p0
What is a goodness-of-fit test?
compares count data to a probability distribution (expected frequencies) of a set of categories
What are the hypotheses for a π2 goodness-of-fit test?
H0: data come from a particular probability distribution
HA: data do NOT come from that distribution
What is the test statistic for π2 goodness-of-fit test?
π2
How does the number of categories affect π2 goodness-of-fit test?
the more categories you have, the more opportunities to deviate from expectations
What is the degree of freedom of a test?
specifies which of a family of distributions to use
What is the equation for degrees of freedom for π2 test?
df = (number of categories) - (number of parameters estimated from the data) - 1
What is the critical value?
value of the test statistic where P = πΌ
if observed π2 > π2 corrected for df, we reject the null hypothesis
if observed π2 < π2 corrected for df, we DO NOT reject the null hypothesis
What is a test statistic?
number calculated from the data and the null hypothesis that can be compared to a standard distribution to find the P-value of the test
Can π2 goodness-of-fit test substitute for binomial test?
yes, because it works even when there are only two categories
- very useful if the number of data points is large
- BUT not recommended if binomial test is possible β two categories (success and failure)
What are the assumptions of the π2 goodness-of-fit test?
- no more than 20% of categories have Expected < 5
- no category with Expected β€ 1
(if needed, combine categories to satisfy these requirements)
What is a discrete distribution?
probability distribution describing a discrete numerical random variable
ie. number of heads from 10 flips of a coin
ie. number of flowers in a square meter
ie. number of disease outbreaks in a year
What is the Poisson distribution?
describes the probability that a certain number of events occur in a block of time or space, when those events happen independently of each other and occur with equal probability at every point in time or space
used to ask questions about random events (by chance)
What is contingency analysis?
test the independence of two or more categorical variables
What is the equation for degree of freedom for π2 contingency analysis?
df = (# of columns - 1) (# of rows - 1)
What are the assumptions for π2 contingency analysis?
this test is just a special case of the π2 goodness-of-fit test, therefore the same rules apply
- no more than 20% of categories have Expected < 5
- no category with Expected β€ 1
What is the Fisherβs exact test?
for 2 x 2 contingency analysis
- does not make assumptions about the size of expectations
R (or other programs) will do it, but difficult to do by hand
What are odds?
probability of success divided by the probability of failure
What is odds ratio?
odds of success in one group divided by the odds of success in another group
OR < 1 means treatment helps
OR > 1 means treatment makes things worse
What is a normal distribution?
- distribution fully described by its mean and standard deviation
- symmetric around its mean
- mean, median, and mode are all the same
- 67% of random draws from a normal distribution are within one standard deviation of the mean
- 95% of random draws from a normal distribution are within two (1.96) standard deviations of the mean
For a standard normal distribution, what is the mean?
mean (ΞΌ) = 0
For a standard normal distribution, what is the standard deviation?
standard deviation (Ο) = 1
What is a standard normal table?
gives probability of getting a random draw from a standard normal distribution greater than a given value
Is a standard normal symmetric?
yes
Pr[Z > x] = Pr[Z < -x]
What is the total area under the curve of a standard normal distribution?
1
Pr[Z < x] = 1 β Pr[Z > x]
Are all normal distributions shaped the same?
yes, just with different means and variances
Can any normal distribution be converted to a standard normal distribution?
yes, by Z: standard normal deviate
- Z tells us how many standard deviations Y is from the mean
- probability of getting a value > Y is the same as probability of getting a value > Z from a standard normal distribution
Are sample means normally distributed?
yes, if the variable itself is normally distributed
- mean of the sample means
- standard deviation of the sample means
What is the standard error of an estimate of a mean?
the standard deviation of the distribution of sample means
What is the central limit theorem?
sum or mean of a large number of measurements randomly sampled from any population is approximately normally distributed
Why do we βfail to reject H0β rather than βaccept H0β after a test in which the P-value is calculated to be greater than Ξ±?
failing to reject H0 does not mean H0 is correct, because the power of the test might be limited
null hypothesis is the default and is either rejected or not rejected