Stat 354 Flashcards
sampling theory vs. classical statistical theory
- concerned w/ finite populations
- different goals and restrictions
- no density function, limited use of models
If N = n
complete enumeration
census
why survey? (survey vs census)
time cost speed scope accuracy
Principle steps for surveying
Objectives Resources Population Units of observation Data to collect Method of measurement organization of field work summary and analysis
steps for surveying, Objectives
precise statement of objectives
steps for surveying, resources
quantity of information “purchased” , cost of information for whole survey
resources (quantity) depend on
number of observations made (items sampled)
design of survey
Determining/setting resources
determine sample design to obtain:
- most information (lowest SE) for a given budget
- most observations/cost for a given level of precision (SE)
If resources can not meet the objective
do not survey
Target population
population of interest
collection of elements about which we wish to make inference
Element
object from which we take a measurement
Target population example
collection of voters in a community
Element example
a registred voter
Sample population
population sampled from
Discussing the target population
be aware of assumptions made to make the leap from sample population to target population
Example of sample population
collection of registered* voters in a community
observational unit
element
sampling unit
unit selected for a sample
- may contain 1+ observational units
- non-overlapping collection of elements from the population
sampling unit example
a classroom
observational unit example
a student in a classroom
sampling frame
list of all sampling units in the population
sampling frame example
list of all students in the school
list of all registered voters in the community
reduced data quality
if you ask too many questions
-focus questions, be concise
measurement methods
self-administered questionaires
telephone, email, door-to-door, internet
very important step in methods
test questionare on small-scale - pilot study, pre-test
improve and re-assess
steps for surveying, organization of field work
- train people in goals and methods
- early quality checking
- plan for non-response
steps for surveying, summary and analysis
- edit questionnaire, record errors
- methods for handling non-response
- different estimation methods
- estimation of precision
Non-response
some elements of sample fail to provide responses to survey
Non-response bias
if non-responders have differing opinions/ measurement from responders, bias occurs
non-response bias especially important when
non-response rate is high
selection bias
some units more likely to be included in sample than other
-cannot be overcome by increased n
sample
collection of sampling units drawn from sampling frame (single or multiple frames)
Literary digest poll, 1936
predicted 57% for Landon
highest response in history, 2.4million
Roosevelt won 62%
why did Literary digest fail
SRS from phone book and club membership – selection bias (only rich 1/4 of pop. had phones)
what to learn from Literary digest poll
when selection procedure is biased, no size of n will help
personal vs mailed surveys
personal ca. 65%
mailed ca. 25%
how to find out if a sample is any good
ask how it was taken
Gallup poll, 1936
George Gallup
n = 50,000ppl
predicted Roosevelt victory (56% vs truth 62%)
predicted Digest results (44% vs truth 43%)
Quota sampling
- interviewer assigned fixed number (quota) of subjects to interview
- # s w/i categories are fixed
example of quota categories
residence
age
sex
economic status
goal of quota sampling
aims to be representative based on census data
ex. design sampling based on % men vs women in population
problems with quota sampling
- while sample controls for certain variables, not the one of interest (ex. can’t control of republican vs democratic)
- interviewers are free to choose who they want within quota
sources of error in surveys
Errors of non-observation
Errors of observation
Error of non-observation
sampling error
coverage error
non-response
sampling error
deviation between sample estimate and true population value
coverage error
sampling frame does not match perfectly w/ target population
errors of observation
interviewers
respondents
Interviewer error
effect response of respondent in some way
example of interviewer error
body language
how to reduce sampling error
- sampling design
- sample size
- investigator
coverage error example
people who are unlisted in telephone book
Respondent error
differ in their ability and motivation to answer correctly
-response error
Response errors
recall bias
prestige bias
intentional deception
incorrect measurement
Recall bias
different responders recall differently
prestige bias
exaggerate to appear more prestigious
example of prestige bias
exaggerate income
Intentional deception example
don’t want to admit to breaking the law
incorrect measurement
respondent doesn’t understand measurement units
ex. report on cm vs m; cups of coffee vs travel mugs
how to reduce non-response in data collection
reward for responding
inform ahead of time
shortened, concise, focused questionnaire
callback, persistence
marketing - train interviewers to ‘sell it’
data cleaning - check for errors
sampling distribution of ȳ
distribution of values of ȳ over repeated samples of same size
characteristics of ȳ sampling distribution
- mean = µ
- standard deviation σ/n
- approximately bell-shaped
- assumes population is infinite
sampling distribution if n is too big
shorter tails than normal
truncated
non-normal
covariance
large | Cov(y1, y2) | = greater dependence btw y1, y2
depends on scale of measurement (units)
standardize by correlation
SRSWR
n independent samples of size 1
may include duplicates
SRSWOR
every possible subset of n from N equally likely to be chosen
what is the probability of selecting an individual sample in SRSWOR
1/ (N choose n)
N choose n
(N!) / n!(N-n)!
n!
product of all positive integers less than or equal to n
ex. 5 ! = 5 × 4 × 3 × 2 × 1 = 120
what is the probability that the ith unit is in the sample (πi)?
n/N
P(ith unit in sample) = n/N = πi
πi =
samples that contain i / total number of possible samples
ways to draw a SRS
- haphazard sampling
- list all (N choose n) subsets, choose at random
- random number generator
- blind sampling
- draw elements at random, include if not duplicates
haphazard sampling
using own judgement to draw a sample
≠ random sample
fpc
finite population correction
1 - (n/N)
when N is large, fps is
ca. 1
1 - (n/N) = 1 - (ca. 0)
CLT for SRSWOR
n –> N –> ∞
n/N –> C less than 1
n, N, N-n must be ‘sufficiently large’
n ≥ 50 usually ok
in experimental design, what is used to reduce variability
blocking (analogous to stratification)
strata
division of population into a number of non-overlapping groups
stratified random sample
SRS drawn from each stratum
advantages of stratification
- if different means in sub pop.’s may be more precise
- administrative advantages
- can obtain separate estimates of each parameter for each strata
ai
proportion sampled in each stratum
how do we decide ai
small variance
lowest cost
Best allocation is affected by
Ni (# of elements in each stratum)
Si^2 (variability in each stratum)
Cost of obtaining an observation in each stratum
How do factors that affect allocation impact sample size
larger sample sizes to strata w/ larger pop.’s
larger sample sizes to strata w/ larger variability
smaller sample sizes if costs are high
Types of allocation models
Optimal allocation
Neyman allocation
Proportional allocation
Optimal allocation
most information for least cost
choose ni to minimize V(yst) for a fixed C or minimize C for a fixed V(yst)
C = Co + E cini
Neyman allocation
special case of optimal allocation
used when costs are equal in all strata
Proportional allocation
split sample into strata w/ same proportion as population
ni/n = Ni/N
the stratified estimator (yst) is the average of all observations
rounding rules
always round up for n, except for optimal allocation (don’t cross budget)