Sampling Flashcards
census data
data describing variables for every single case in the population of interest. Costly.
random sample
procedure to achieve an equal probability sample, randomly sampling each unit in the population to ensure they have the same probability of selection.
probability/equal probability sample
sample in which every population unit has a known probability of selection. Equal probability sample- every unit has the same probability of selection.
sampling error
Indicated by some statistic, sampling error is how far off the property characteristics of a variable are from the population. The larger the sample size, the more representative and closer to the population your sample is. -“random chance”
-data will be off but in no way systematically
representative sample
very similar of what one would have found had they used census data, can generalize. no external validity problem.
non-response bias
Systematic bias doing from missing data on some or all variables for a given sample of cases. Missing data is almost always systematically different than data that is accounted for, thus, you cannot generalize your data to the overall population. Not representative.
selection bias
Bias data set due to sampling bias and/or non-response bias that has both external and internal validity problems; case missing from the dataset are systematically different on the dependent variable
external validity problem w/ selection bias
The bias in one’s sample is correlated with one’s dependent variable, meaning that it is systematically different from the population in terms of “y” score- too many/few high “y” scores or too many/few low “y” scores. One’s sample clearly has external validity problems as it is not representative of the population on “y” and cannot generalize.
internal validity problem w/ selection bias
The correlation between one’s “y” and any “x” variable will be biased toward zero, signifying no correlation, it will tend to be closer to zero than one would find in the full set of population data (census); i.e. your correlation will be closer to zero than is actually true. Because the correlation between x& y determines if “x” has a causal effect on “y”, we cannot make causal inferences by the existence of selection bias in sampling.
post-stratification weights
Identify: calculated after the data is collected based on things that went wrong w/ a sample, ex. non-response bias. They are used to create a more representative sample-if your sample has too many or too few of a certain group/groups, post-strata weights can count them as more or less than one.
Significance: creating a sample that looks more representative, however, the weights make the bias worse. Subjects who did not respond are most likely different than those that do, thus, applying weights to make these respondents more or less does not create a truly representative sample.
sampling weights
Sampling weights are calculated before you collect data, used to create a population correct average when one has oversampled a particular group or groups.
stratified random sample
identify: a type of sampling that breaks the population into groups called strata and draws a random sample within each group
significance: Gold standard, SRS eliminates sampling error on the stratifying variable by ensuring that the sample distribution is the same as the population distribution on that variable. It also reduces sampling error on any variable correlated with the stratifying variable. The sample’s representativeness is increased as the sampling error decreased. More dispersed sample.
systematic random sample
-random draw to start then move forward.
When is this most commonly used? 1. economic research. 2. exit polling in politics
ex: pick a # 1-10, 7. Interview every 7th person you see; systematically choosing who to interview.
cluster sampling
1) randomly select clusters. 2)randomly select cases w/in clusters.
ex: select 10 states, select 50 senators within each state. OR select 25 states, select 20 senators w/in each state.
- -this sample as a whole is less clustered and more dispersed as you have more locations. your sampling error decreases because your clusters are more representative of the population.
Why use it?
1) less costly (having to travel to less places, for ex)
2) forced into cluster sampling by not having a population list.
convenience sample
Non-probabiliy sample that gathers a group of case however is most convenient. The sample has no ability to generalize to the population, an external validity problem.