Statistics Flashcards
Population mean (Formula)
Population variance (Formula)
Sample mean (Formula)
The sample mean estimator is unbiased
Sample variance - unbiased (Formula)
Generalisability (Def)
Results from statistical inference are generalisable when estimates obtained from a sample are reflective of the target population’s parameter
Sampling distribution (Def)
If we take several samples from a population, the sample estimates will differ due to sampling variation. The sampling distribution is the distribution of the sampling estimates
Standard error (Def)
The standard dev of the sampling distribution is a measure of the sampling variation and it’s called Standard error
Sampling methods - non-probability (Types)
- Convenience
- Systematic
- Purposive
- Quota
Sampling methods - probability (Types)
- Random
- Cluster
- Stratified
Convenience sampling
A type of non-probability sampling.
Sampling based on how convenient the subjects are to find.
Pro:
- Affordable, easy and quick
- Works ok if the population is homogeneous
Con:
- Not representative if the population is heterogeneous
Purposive sampling
A type of non-probability sampling.
The researcher relies on their own knowledge when choosing members of the population.
Pro:
- Beneficial when we want to access a subset of the population
Con:
- Requires domain knowledge
- Might not be representative
Quota sampling
A type of non-probability sampling.
Tailors the sample to be in proportion to some known characteristic of the population.
Pro:
- Affordable, easy and quick
- Accounts for differences in groups (strata)
Con:
- Selection bias if convenience sampling
- Needs prior knowledge to know the strata
Systematic sampling
A type of non-probability sampling.
Sampling at regular intervals, one every k=n/N
Pro:
- Can extend the sampling procedure to whole population (i.e. more representative)
Con:
- Needs knowledge of the whole population
- The order of the units can cause systematic bias
Bias of an estimator (Formula)
Precision of an estimator (Formula)
Mean Squared Error (Formula)
Population total (Formula)
Variance of the sample mean, for finite populations (Formula)
Stratified sampling (Def)
If the population of interest is heterogeneous with respect to the characteristic (parameter) of interest, one sampling procedure that can increase PRECISION is Stratified sampling.
Def: Partitioning a population into non-overlapping groups and sampling within each group. Each group is called a STRATUM.
If the sampling is done randomly, it’s called Stratified random sampling
Stratified sampling (Principles)
(1) Strata should be non-overlapping
(2) Strata should form a partition of the total pop
(3) Units within a stratum should be more similar to each other than others w/ respect to the characteristic of interest
(4) We should aim for homogeneity within strata, relative to the pop
(5) Success depends on the choice of characteristic used to partition the pop
Sampling fraction of the stratum (Formula)
Proportionate stratification (Def)
Each strata is represented in the sample in proportion to (see pic)
W_i = N_i / N
is the proportion of the Pop. within stratum i
Strata estimator of the population mean (Formula)
The estimator is unbiased.
W_i = N_i / N
Variance of the Strata estimator of the population mean (Formula)
Where:
W_i = N_i / N
f_i = n_i / N_i
S_i is the population standard dev WITHIN the stratum
Cost of sampling
Where c_i is the cost of each unit in stratum i, and n_i is the sample size of stratum i
Strat sampling - Proportional allocation (Formula)
Choosing size of each stratum’s sample size based on the stratum’s proportion with respect to the population size
Strat sampling - Optimal (Neyman) allocation (Formula)
Note: requires you to know the pop. standard deviation of the stratum upfront!
Stratified sampling (When and Pros/Cons)
Use when:
- The target population is heterogeneous
- Subgroups (strata) can be defined
- We want a sample to be representative of these groups
Pros:
- Works well if a pop. contains a wide variety of characteristics that may be used to group units
- Gives smaller standard errors and greater PRECISION vs. SRS (the more heterogeneity between strata, the greater precision)
Cons:
- Need prior knowledge to form strata
Cluster sampling (Def)
If it is not feasible to access all units, then Cluster sampling may be used.
Def: Cluster sampling is a sampling procedure where we sample units within a population using a sampling method where sampling units are clusters.
Assumption: units within clusters are heterogeneous, clusters are homogeneous
Cluster sampling (Principles)
(1) Divide the pop. in natural clusters based on some rule (e.g. geographic area)
(2) Treat each cluster as a sampling unit
(3) Sample clusters based on a sampling method
(4) Collect all info from all sampling units within a cluster
Cluster sampling (Pros and cons)
Pros:
- Do not need to access all units of a population (e.g. if we have no info or it’s too expansive)
Cons:
- Clusters may not reflect the true diversity of the population
Constructing clusters - Equal size
If we sample n clusters from N equally sized clusters and y-bar_i is the sample mean in cluster i, the mean of cluster means (see pic) is an unbiased estimator of the pop mean y-bar
Constructing clusters - unequal size
N = no. of clusters M_i = units within cluster i M = pop. size M-bar = M/N, avg. number of units in clusters n = no. of clusters sampled y-bar*N = avg. value of y across clusters
Summarising numerical data (3 Key aspects)
The distribution of a dataset can be summarised by:
- Location: the centre of the distribution (e.g. mean, median)
- Spread: variation/range of the data
- Shape: shape of the distribution
Median (Defs)
If there’s an even number of observations, it’s the average of the two middle values.
- Range: max value minus min value (not robust, i.e influenced by outliers)
- LQ: at depth 1/4 * (n+1)
- UQ: at depth 3/4 * (n+1)
- IQR: UQ-LQ (robust)
Common plots
- Dotplot (discrete data)
- Histogram (cont data)
- Bar chart (categorical data)
- Stem and leaf diagram (cont data)
- Boxplot (cont data)
- Scatterplot (compare 2 cont vars)
Boxplot - Formula for outliers
Lower fence = LQ - 1.5 * IQR
UPPER fence = UQ + 1.5 * IQR
Parameter vs. Statistic
Parameter: single number summarising the variable of interest in the population
Statistic: same as above, but within a sample. It’s a FUNCTION of the data in the sample
P-value (Def)
It’s the probability of obtaining a value for your TEST STATISTIC that is at best as extreme as the observed value, assuming H_0 is true
Interpretation of Confidence Intervals (95%)
Under repeated sampling and recalculation, 95% of CIs would contain the true population value
Pivotal function (Def)
A Pivotal function is a function of the data, X, and a parameter of interest, θ, which when regarded as a r.v. calculated at θ_T (true value of θ), has a probability distribution whose form does not depend on any unknown parameter. We denote it by PIV(θ_T, X)
Pivotal function for t-test
Hypothesis testing framework
(1) Specify H_0 and H_1
(2) Define a test statistic (TS)
(3) Compute the observed value of the TS from the data
We reject H_0 in favour of H_1 when |t| (or TS) is too large to be consistent with H_0. For example, adopting a significance level of α=0.05, we reject H_0 if:
|t| > t_0.975(n-1)
This is a one-sample t-test
One-sample t-test (Assumptions)
- Our data x_1, … , x_n have arisen from a normal distribution
- Our data points are independent from one another
Paired t-test
If the data come from the same individuals (e.g. 2 measurements), the independence assumption does not stand. If appropriate, we can take the difference of the measurements, reducing it to a One-sample test.
In this case, the differences are assumed to be independent of each other
Non-parametric tests (Explanation)
Non-parametric tests make fewer assumptions (e.g. normality is not required), so they can be used instead of e.g. t-test. For example, the Wilcoxon signed ranks test performs inference on the median
Two-sample t-test (Assumptions)
(1) All values are independent
(2) The distribution of the variables of interest is the same for both populations (apart from possibly the mean. I.e. same variance)
(3) they are distributed normally
Two-sample t-test - Pivotal function (Formula)
Two-sample t-test - Pooled variance and CI (Formula)
We can get a better estimate of σ-hat by pooling the info from the two samples
Proportion test - CLT derivation
Suppose X ~ Bi(n, θ). Then X is The number of successes in n independent trials, each trial having θ probability of success.
We can use the Normal approx to the Binomial distribution (CLT):
X ~ N(nθ, nθ*(1-θ))
This requires a discrete dist. to be approximated with a continuous one, introducing some inconsistencies. For this we use a continuity correction.
The CLT tells us that for sufficiently large n (e.g. n>=20, nθ>=5 and nθ*(1-θ)>=5) the test statistic of interest is:
Point estimators (Def and desirable properties)
A point estimate is a particular numeric value of the function t(x), obtained from a particular set of data x = (x_1, x_2, … , x_n).
Properties:
(1) Range of t(X) should be the same as the range of θ
(2) t(X) should be UNBIASED
(3) t(X) should be CONSISTENT
(4) MVUE - Minimum variance unbiased estimator
Note: the lower the variance, the more efficient the estimator
Maximum likelihood estimation (Motivation)
Selecting the value of θ (parameter) for a chosen probability distribution, for which our given set of observations has a maximum probability.
Given observed data and an assumed probability model, we want to find estimates for the population parameters that maximise the likelihood that our distribution fits the data
MLE (Def)
The MLE is the value of θ-hat, where θ is the population parameter for any probability distribution, which maximises the Likelihood function L(θ; x)
Likelihood function - Discrete (Formula)
Log-Likelihood function (Formula)
It’s often easier algebraically to work with the natural log of the L function: since ln(x) is a monotonic function, l(θ) reaches its maximum at the same value of θ as L(θ, x)
Likelihood function - Continuous (Formula)
For continuous distributions, the PDF of X_i evaluated at x_i does not directly represent the probability of the data. However, it is approximately PROPORTIONAL to the probability that X_i lies in a small interval around x_i, so it’s reasonable to take the likelihood of the parameter θ to be:
MLE (Steps)
(1) Evaluate the Likelihood function L(θ; x)
(2) Obtain the Log-Likelihood function l(θ)
(3) Differentiate wrt θ and set l’(θ) = 0
(4) Solve for θ
(5) Verify it’s a maximum, i.e. l’’(θ)
Newton-Raphton method (Def)
When it’s not possible to find the root of l’(θ) = 0 in closed form, we can use the method to find the MLE numerically
Efficiency (Def)
An UNBIASED estimator is said to be efficient if it has the minimum possible variance; the efficiency of an unbiased estimator is the ratio of the minimum possible variance to the variance of the estimator