Sampling and Statistical Inference Flashcards
The Goal of Statistical Inference
1.Learn a quantity of interest about a particular group (population parameters)
2.Often information on all members of the group (population) will not be available
3.We use sampling to collect a limited amount of information and use it to infer
population properties (parameters).
4.Since we have only information on a subset of the population we are uncertain
about our inference (but there are other sources of uncertainty as well even if we
observe the entire population)
5. All inferences are inherently uncertain
The goal of statistical inference is to estimate population parameters and summarize our
uncertainty about these estimates
What is the goal of the parameter?
parameter describes a feature of the population. The parameter is fixed at some
value, and we will never be able to know it for sure
What is random sample which we observe?
What we observe is a random sample, drawn from the population. A random sample
is a proper subset of the population for which it is true that each member has an
equal probability to be selected.
What is an estimator of the population parameter?
Sample statistic of a population parameter. A
sample statistic is a function that is applied on the observed sample. This function
is called the estimator of the population parameter.
We can calculate the mean of a random sample. The mean is then a sample statistic, and
the function that maps observations of the random sample to this sample statistic is the
estimator
Example of estimator, estimate and estimand
We rely on the observations in our sample and use a linear (regression) function (the estimator) to estimate the causal effect of education on income in our sample which is our estimate for the population-level causal effect (the estimand)
The Principle of Sampling
Probability sampling: Select from a population with size N a number of individuals, n
(usually n ≪ N), such that each individual has a non-zero probability of being chosen
Sources of variation across samples
Sampling variability: Means and standard deviations of repeated samples will not be
identical
Sampling error: An estimate from a sample will not be identical to the value in the
population
The sample size is positively related to the desired precision of the estimate.
Conducting Statistical Inferences
We have: (1) a population, (2) a sample from this population, and (3) an estimate of a
population parameter.
* How uncertain are we about that estimate?
* Alternatively, how precisely can we estimate the population parameter?
* In a different sample, our estimate would be slightly different. Hence, estimates vary
over repeated samples.
* Applying an estimator on repeated samples yields a sampling distribution for this
statistic.
* Calculating the spread of this sampling distribution yields a measure of uncertainty.
Example of Statistical Inference and formula of standard error
- Let there be a country with 100,000 inhabitants.
- We want to know what the mean income of this country is.
- We sample 5000 individuals randomly from the population.
- The mean of obtained sample is 1400 (= θ_hat) with a standard deviation of 2000 (= σ_hat).
- The standard error of the estimate of the population mean (θ_hat) is:
σpop/√n. - For a large sample, the standard deviation (σ_hat) of a sample can be used as an
approximation of the population standard deviation (σpop):
SE(θ_hat) =σ_hat/√n=2000/√5000=28.3
The mean income of our population is estimated to be 1400 ± 28.3
Digression: Derivation of Standard Erro
- Let’s assume that we have a random sample from a population, i.e., we have n
random variables, θ1, …, θn that come from the same population represented by a distribution with mean µ and variance σ^2 - Such random variables are called independently and identically distributed (iid)
- Hence, we know that Var(θi) = σ^2
for all such random variables. - Denote their mean as an estimate of the population mean by θ_hat
- Then, we get the sampling variance (i.e., variance of an estimator)
Var(θ_hat)=1/n*σ^2
What is confidence interval of q%?
We call a confidence interval a q% confidence interval if it is constructed such that it
contains the true parameter at least q% of the time if we repeat the experiment a
a large number of times.
Three Different Approaches to Assess Uncertainty
- Analytical
- Bootstrapping (resampling)
- Simulation (parametric
Analytical Approach: CIs via Normal Approximation, how?
- We have a sample statistic θ_hat estimated for a parameter θ.
- If the sample is large enough, we can assume a normal sampling distribution with
mean θ_hat and variance Var(θ_hat) - We can then construct a 95% confidence interval using the quantiles from the
standard normal distribution:
θ_hat+/-1.96√Var(θ_hat)=θ_hat+/-1.96σ_hat/√n
What is bootstraping?
Bootstrapping estimates the sampling distribution of θ by repeatedly sampling (with
replacement) from the original sample.
- Take s samples of size n from your data.
- Calculate the quantity of interest ((θ_hat)i, e.g. the mean) for each of your s samples, which
yields a vector of length s. - A simple confidence interval for your quantity can be obtained by calculating quantiles
(e.g., 2.5 and 97.5 percentiles for 95% CI) of this vector
How sampling works?
- Create a (normal) sampling distribution from the mean and standard error of your
sample. - Take s draws from that distribution N(θ_hat, σ_hat^2).
- Calculate your quantity of interest s times. Thus, we simulated its sampling distribution.
- Calculate summaries, such as means and standard errors, for the resulting vector of
length s. To construct 95% CIs, you need to 2.5 and 97.5 percentiles.