Weeks 4 & 5 - Stratified & cluster sampling Flashcards
Remarks about the 3 types of allocation for stratified sampling (+ general optimal allocation)
Define proportional and Neyman allocation [2011]
Proportional allocation
- Sample size in stratum proportional to stratum size (nh proportional to Nh)
1. Sample ‘mirrors’ population, representative
2. Simple estimation
Neyman allocation
- Sample size in stratum proportional to stratum standard deviation (nh proportional to NhSh)
1. a special case of optimal allocation when costs for all strata are the same.
2. If stratum standard deviations vary a lot, Neyman allocation gives the greatest advantage
- Variance is minimised when nh is proportional to NhSh
Equal allocation
1. If similar variance, implies that standard errors are similar.
Therefore, point estimates are comparable
4 reasons for stratified sampling
- To study specific SUBPOPULATIONS
eg. F vs M, geographic regions - To assist in implementing OPERATIONAL ASPECTS of survey
eg. big and small farms - To improve representativeness
- To improve precision
^with homogeneous strata
Why is SRS possibly problematic? And why could stratified sampling be advantageous?
SRS can’t guarantee small variance b/c of the equal probability of selection -> results can vary a lot.
+ could also lead to extreme samples, eg. all men
Stratified sampling can have smaller variance (if homo within)
What are the primary sampling units & secondary sampling units in cluster sampling?
*we want clusters to have high heterogeneity within -> mini version of pop, good representation
PSU = clusters, eg. blocks, schools
SSU = eg. households in the blocks
Cluster sampling - estimated population size formula
M hat = N x (m bar)
Cluster sampling - design effect formula
Comment on how you would decide between the unbiased estimator under simple random sampling and the ratio estimator under cluster sampling. [6m, 2018]
Var(t cl hat) / Var(t SRS hat) ~= [1 + (K-1)ICC ]
where ICC = intra-cluster correlation coefficient measures homogeneity within clusters
- When ICC=1, perfectly homogeneous & Var(t cl hat) > Var(t SRS hat)
- When ICC=0, perfectly heterogeneous & Var(t cl hat) = Var(t SRS hat)
- The PRECISION of the ratio estimator of t under cluster sampling may be expected to be WORSE than the precision of the unbiased estimator of t under simple random sampling…
- b/c of a TENDENCY FOR HOMOGENEITY of cow weight on farms.
- The RELEVANT FACTOR is the INTRA-farm CORRELATION of weight of obese cows.
*Bear in mind that the goal is for the clusters to be just as heterogeneous as the whole pop, so that the selection of a given cluster will yield the same information as the random selection of individuals from the entire population.
Describe briefly 1-stage cluster sampling. [2m]
The population is divided into N clusters, where cluster i is of size Mi
for i = 1, 2, …N
and where elements in cluster i are labelled j = 1, 2, …, Mi
- Sampling units are clusters which are group of elements and we select a simple random sample of n clusters but can use any design to select clusters.
- Regarding data collection, information should be collected on all elements in the cluster.
2 reasons why cluster sampling is used
- May not have a list of elements for a SAMPLING FRAME, but a list of clusters may be available
- May be CHEAPER to conduct the study if elements are CLUSTERED
Explain how cluster sampling may lead to less precise estimates. [2m]
- Elements within clusters tend to be similar, ie. clusters tend to be homogeneous
- Homogeneous clusters give less information than if the same no. of unrelated elements are selected
4 factors that should be considered in the choice of stratification scheme [4m, 2016]
- whether any SUBPOPULATIONS are of interest (in which case these might form strata);
- to improve PRECISION would like to stratify by variables STRONGLY RELATED to principal variables of interest (to achieve HOMOGENEOUS strata);
- whether COST of data collection varies by some factor in which case might wish to stratify by this factor;
- are there any reasons why different modes of data collection would be used for different kinds of firms (in which case these might define strata).
If the Q asks if the difference between 2 estimated means is significant, what should I do?
(from 2012)
Derive the standard error of the difference then say significant or not
Explain why stratified sampling can be seen as an extreme form of cluster sampling.
[4m, 2011]
A special case of two-stage cluster sampling: all ‘clusters’ are sampled, then a sample from each cluster.
{I guess b/c there is a tendency for homogeneity within clusters?}
Prove the difference in variance between proportional and Neyman allocation is 1/n (summation to k) Wi(Si - Sbar)^2
What is Sbar? Explain when the difference is greatest between the precision of the estimators.
[2011]
Sbar = summation(WiSi)
See 2011 paper for workings
Difference is greatest when the stratum std devs Si vary a lot from each other.
What information would it be helpful to have in judging the suitability of a stratifying variable?
[2014]
Need information on the distribution of __ BETWEEN & WITHIN stratification categories to judge
How to describe if it makes sense to use stratified sampling given a BOXPLOT?
If suitable,
- strata are homogeneous
- within stratum variances clearly lower than overall variance
- so stratified sampling will reduce precision compared to SRS