data collection Flashcards
what is data collection
Data collection is the systematic process of gathering and measuring information from various sources to answer research questions, test hypotheses, and make inferences.
what is a sampling unit
Sampling unit: an individual object, animal, or person, on which measurements can be made.
what is a target population
The target population is the entire group of sampling units we want to study or make inferences about.
what is a census
A census measures every individual in the target population. It is often an official survey conducted by governments to gather demographic data.
3 advantages of a census
Provides complete and accurate data.
No sampling error since everyone is included.
Useful for policy-making and resource allocation.
What are the disadvantages of a census?
Expensive and time-consuming.
Difficult to access the entire population.
Data may become outdated by the time analysis is complete.
what is a sampling protocol or design
The procedure or strategy used to select sampling units from the target population.
what is a sample
A subset of individuals or sampling units selected from a target population for analysis to estimate parameters or test hypotheses.
what is a variable
A variable is a characteristic of each sampling unit that is measured (e.g., age, blood group, voting preference), usually denoted by lowercase Roman letters (e.g.𝑥, 𝑦).
what is a parameter
A parameter is a numerical summary of a variable for a population, usually represented by Greek letters (e.g. 𝜇 for the true mean)
what is a statistic/estimate
A statistic (or estimate) is a numerical summary of a variable for a sample, often used to estimate a population parameter (e.g. 𝑥ˉ estimates 𝜇)
4 data collection methods
Censuses – measuring the entire target population.
Polls and surveys – collecting responses from a sample.
Randomized designed experiments – manipulating variables under controlled conditions.
Observational studies – collecting data without intervention.
what is a survey
A survey is the process of collecting data from a sample in order to obtain information about the whole population.
what is an opinion poll
An opinion poll assesses public opinion by questioning a random or representative sample. Often used for election forecasting.
Why use a survey instead of a census? (3)
- Cheaper
- Faster
- More practical (accessing the entire population may be difficult or impossible)
what is sampling error
This variation between samples is called sampling error and it is unavoidable without taking a census.
Why is random sampling important? (4)
Gives each member of the population an equal chance of selection.
Reduces bias.
Allows calculation of sampling error.
Larger samples improve representativeness.
- more representative sample
What are accuracy, precision, and bias in statistics?
Accurate – Sample statistic is similar to the population parameter.
Precise – Statistic is consistent across multiple samples. A lack of precision may arise from sampling error e.g. where sample sizes are very small.
Biased – implies that the sample statistic tends to differ from the population parameter in a consistent way (there is a systematic error)
what is the goal in sampling
To select a sample that reflects the variation in the whole population without sampling the entire population.
Why is careful sampling important?
Poor data collection can lead to flawed conclusions.
A well-chosen sample allows for accurate and robust decisions.
Uncertainty is inherent in sampling, so methods must minimize errors.
3 different sample strategies
Simple random sampling
Systematic random sampling
Stratified random sampling
what is simple random sampling
A method where each individual in the population has an equal chance of being selected.
What is the formula for the probability of selection in simple random sampling?
The chance, or probability, of being selected in a sample of size 𝑛 from a population of size 𝑁 is:
chance of selection=𝑛/𝑁
What is the probability of a student being selected from a sample of 20 students from a class of 130
20/130 = 0.1538 = 15.38%
Given a University of St Andrews population of 13,484, what is the probability of being selected in a sample of 650?
650/13484 = 0/0482=4.82%
what is systematic random sampling
A method where a sample is selected at regular intervals after a random start.
What are the advantages of systematic sampling?
Easier to implement (only one random number needed).
Ensures even distribution of the sample across the population.
Works well for time-based selection (e.g., traffic monitoring).
How do you calculate the fixed periodic interval (𝑘) in systematic sampling?
K=N/𝑛
N = Population size
𝑛 = Sample size
What is the fixed periodic interval 𝑘 for a sample of 650 from 13,484 individuals?
If the random start is 𝑞=9, what are the first three individuals selected using systematic sampling?
𝑘 = 13,484 / 650 =20.75≈21
So, every 21st individual is selected after a random start.
q,q+k,q+2k
9, 9+21, 9+2(21)
The first three selected individuals are 9, 30, and 51.
what is the process of systematic random sampling
Suppose there are 𝑁=1,000 sampling units in the target population and we want to take a sample of size 𝑛=20. A systematic sample can be selected as follows:
- Calculate 𝑘, the fixed periodic interval. This is the interval between successive samples. 𝑘=𝑁/𝑛 e.g. 𝑘=1000/20=50
- Randomly pick a starting number from 1 to 𝑘, inclusive, call it 𝑞. For this example, we want a number from 1 to 50, say 3 was chosen at random, 𝑞=3.
- Sample the 𝑞th individual, then the (𝑞+𝑘)th, then the (𝑞+2𝑘)th and so on. Therefore, the sample is 3, 53, 103, 153, …, 903, 953
What is a potential issue with systematic sampling?
If the population has periodic variation and the fixed interval k matches the pattern, the sample could be biased.
what is stratified random sampling
A method where the population is divided into strata (categories), and random samples are taken from each stratum.
3 advantages of stratified random sampling
Ensures representation of all groups.
Provides better precision than simple random sampling.
Can be more convenient to organise
How do you calculate the proportion of a stratum (𝑝𝑖)?
𝑝𝑖=𝑁𝑖/𝑁
where 𝑁𝑖 is the total number of units in stratum 𝑖.
How do you calculate the sample size for a stratum (𝑛𝑖)?
𝑛𝑖=𝑛×𝑝𝑖
n = Total sample size
𝑝𝑖 = Proportion of stratum 𝑖
How many staff members should be sampled from a total sample of 650, if there are 3,250 staff in a 13,484-person university?
Calculate proportion of staff: 3250/12484 = 0.241
Calculate sample size for staff: 650 x 0.241 = 157
157 staff members should be sampled.
why are sampling errors unavoidable
there is a difference between the sample statistic used to estimate the population and the true parameter for the population.
random variation between samples.
we want random variation to be the only source of any difference between the sample and the truth
6 non sampling errors
- selection bias
- non response bias
- self selection bias
- question effects
- survey format
- behavioural considerations
what is selection bias
A non-representative sample due to flawed selection methods.
what is non response bias
When selected participants do not respond, non-respondents tend to behave differently
what is self selection bias
When people choose to participate, and those who do may differ from those who don’t.
Self-selection bias can be a problem in more ‘serious’ studies, because much behavioural research on people can only use volunteers, for ethical reasons.
what is question bias
The way questions are worded affects responses.
4 ways to prevent question bias
Fixed wording helps reduce bias.
Questions should not lead or prompt respondents.
Logical order prevents confusion.
The respondent may get tired, or bored, if there are too many questions
How can survey format affect responses?
The way a survey is conducted (postal, online, telephone, in-person) can influence answers.
What are some behavioural factors that influence survey responses?
Social stigma (e.g., “Have you ever been arrested?”)
Social prestige (e.g., “How much do you earn?”)
Social acceptability (e.g., “How much alcohol do you drink per week?”)
Misunderstanding or misremembering questions
How can we improve survey accuracy?
Ask participants to keep a daily diary for accurate reporting.
Use neutral wording to avoid social desirability bias.
Pre-test the questionnaire for clarity and consistency.
what is an experiment
A study where conditions are controlled to test the effect of a treatment.
What are some examples of designed experiments?
Medical study: Testing if a drug improves survival rates.
Physics study: Measuring how heat affects electrical resistance.
Agricultural study: Checking if fertilizer increases crop yield.
What two types of variation exist in experiments?
Variation due to treatment (effect of drug, heat, fertilizer, etc.)
Natural variation (differences between individuals, external factors)
what is a randomised designed experiment
A randomised experiment is where the researcher randomly allocates treatments to sampling units to ensure groups are similar, and any differences in response can be attributed to the treatment.
What is the purpose of random allocation in an experiment?
Random allocation ensures that treatment groups are similar on average, avoiding bias and ensuring that differences in response can be attributed to the treatment.
What is the role of randomisation in an experiment?
Randomisation ensures treatment groups are similar, eliminating selection bias and improving the credibility of results.
What is replication in a randomised designed experiment?
Replication involves repeating the experiment multiple times to:
Assess natural variation.
Increase precision (more replicates = more precise but higher cost).
What is blocking in experiments?
Blocking involves partitioning sampling units into different strata or groups (e.g., male/female) before random allocation to treatments. This reduces natural variability, improving precision.
what is a a placebo
A placebo is a substance with no active effect (e.g., sugar pills) given to a control group to avoid psychological effects influencing the results.
what is double blinding in clinical trials
The doctor doesn’t know which treatment is being given.
The patient doesn’t know which treatment they are receiving.
two methods used in clinical trails
placebo
double blinding
What is the significance of randomised experiments in establishing causation?
If a randomised experiment shows a significant effect, it provides strong evidence for causation, i.e., the treatment caused the observed effect.
What was the design of Jonas Salk’s polio vaccine trial in 1954?
The trial used a stratified randomised design, where children were randomly allocated to the treatment or control group.
Treatment group: Received the vaccine.
Control group: Received a saline solution.
Double-blind: Neither the participants nor the doctors knew which treatment they received.
Results: The vaccine was shown to be effective in reducing polio cases.
What are the 5 key components of a randomised designed experiment?
Randomisation: Assign subjects randomly to treatment groups.
Replication: Repeat the treatment on multiple subjects to assess natural variation.
Control group: A group that does not receive the active treatment.
Blocking/Stratification: Group subjects by similar characteristics before randomisation (e.g., age).
Causality: A significant result can suggest causality between treatment and effect. experimental units may first be divided into groups (e.g. on the basis of age)
what is an observational study, and how does it differ from a randomised experiment?
The researcher does not control conditions. They simply observe and measure variables.
whereas a randomised experiment: The researcher manipulates the variables and controls conditions to establish causality.
what is a confounding variable
A confounding variable is an unmeasured factor that affects both the potential cause and the effect, making it difficult to determine the true relationship between the variables.
challenge in observational studies
Confounding variables can make it difficult to argue for causality.
why are observational studies used as evidence for an effect
It may be unethical to carry out a randomised experiment
It may be difficult to carry out a randomised experiment
What are the 3 types of observational studies?
- cohort studies
- case control studies
- cross sectional study
what is a cohort study
A cohort is any group of people who are linked in some way. Researchers tracks the group over time and compares outcomes based on exposure to a certain variable.
what is a case control study
Compares individuals with a health issue (cases) to those without it (controls) to study exposure.
what is a cross sectional study
Measures variables at a specific point in time to understand prevalence.
what is a Retrospective Cohort Study
Looks at historical data to assess past exposure and outcomes.
what can cohort studies be
prospective or retrospective
what are prospective cohort studies
none of the subjects have the outcome of interest (e.g. disease) when the study commences; the subjects are followed over a period of time to determine whether the disease develops.
what do case control studies tend to be
Case-control studies tend to be retrospective and examine previous exposure in relation to the outcome.
what is almost impossible to do with an observational study
infer causation because it is difficult to exclude the possible effects of confounding variables.
advantages of observational studies (5)
cheaper
less time-consuming
effects can be investigated that would be unethical to manipulate
they reflect real-world scenarios
they are suited for long-term studies or assessing trends over time.
advantages of experiments (5)
easier to establish causality. Changes can be more confidently attributed to the treatment.
in a controlled environment the influence of confounding variables is reduced.
often easier to perform again, increasing the ability to compare and confirm results.
minimise bias (with randomisation) to ensure that treatments groups are comparable.
allow the investigation of variables that might not occur naturally.