Stat: Data Collection Flashcards
Primary data
The data is collected by, or on behalf of, the person who is going to use the data.
Secondary Data
The data is not collected by, nor on behalf of, the person who is to use the data. The data are second hand.
Population
The whole set of individuals or items that are of interest.
Census
Every member of the population is observed or measured.
Sample
- A carefully selected sub-set of the population.
- It should be representative of the population.
Sample Survey
This where information about the population is found out from the information obtained from sample.
Sampling Unit
An individual member of the population.
Sampling Frame
A list identifying every single sampling unit that is in the population.
Random Sample meaning
Every possible sample of size n has an equal chance of being selected.
Simple Random Number Sampling
- Each item or individual is given a number.
- Selection is then carried out via random number tables or generators.
- (random numbers generated using a calculator or Electronic random number indicating Equipment)
Systematic Sampling
Elements are chosen at regular intervals from an ordered list.
- (if sample size is 20 from a population of 100, you’d take every fifth person 100/20)
Stratified Sampling
- Population is divided into mutually exclusive strata.
- A simple random sample is taken from each strata.
- The proportion of each strata in the sample is the same as that in the population.
Stratified sampling formula
the no. sampled in a stratum= no. in stratum/ no. in a population x overall sample size
Quota Sampling
- Population is dvided into groups of gender, social class, etc.
- Number of individuals selected is set to reflect the population.
Opportunity sampling
- taking the sample from people who are available at the time the study is carried out
Census Advantages
- It should give a completely accurate result.
Census Disadvantages
- It is time consuming and expensive.
- It cannot be used when testing leads to destruction (e.g testing lifetime of batteries)
- lots of data, so can be difficult to process.
Sample survey Advantages
- It is cheaper than a Census.
- Results are quicker compared to a Census.
- Less data to deal with than a Census.
Sample survey Disadvantages
The data may not be as accurate, so sample is not representative of the population.
- Sample may not be large enough to provide information about the small sub- groups of the population.
Random sampling with replacement
Each unit is replaced back into the population before the next selection is made. So each unit can appear more than once in the sample.
Sampling without replacement
If a unit is selected, it’s not replaced. So for each draw only the sampling units that have not been selected previously are eligible for the next draw.
Quota sampling advantages
- quick since a representative sample can be achieved with a small sample size
- cheap
- easy
Quota sampling disadvantages
- can introduce bias by person picking sampling units
- inaccurate since it’s impossible to estimate the sampling errors as the process is not a random process
- non responses are not recorded
Opportunity sampling advantages
- easy to carry out
- cheap
Opportunity sampling disadvantages
- unlikely to provide a representative sample
- dependent on individual researcher
Measures of location
Comparing mean and median
Measures of spread
Standard deviation and IQR
Which measures to use together?
Mean and standard deviation
Median and IQR
What does it mean if the median is greater than the mean
- There’s fewer large distances
- The distribution is positively skewed
The higher the standard deviation/IQR…
The greater the spread
What month is least sunshine?
October since least sunshine
A stratified sample must have…
a sampling frame
Difference between stratified sampling and quota sampling
Same method but
Stratified vs quota
- sampling frame is required vs not required
- random sampling error can be estimated vs cannot be estimated
When to use median and IQR?
if there are outliers (the data is skewed) then use median and IQR since this will affect mean and standard deviation
How will extra values below the medium affect values?
- Q2 will be lower
- Q3 will be lower
Statistic
A random variable that is a function of a random sample that contains no unknown parameters
Explain what you understand by the sampling distribution of Y
The probability distribution
What’s not a statistic?
The equation with unknown parameters
Sampling distribution
the values of a statistic and the associated probabilities is a sampling distribution
Give a reason we should include outliers and a reason why we shouldn’t
- it’s a piece of data so we should consider all data
- it’s an outliers that could effect the results
The range of distances in m that corresponds to the recorded value 0 for daily mean visibility
0-500m
Use mean or median to analyse data
If outliers use median since it will affect mean
Quota vs stratified
Stratified: Take a (simple) random sample from (mutually exclusive) groups of the population
Sample sizes within strata in strict proportion to numbers in each strata in the population
Quota: Non-random sampling
from groups of the population
Scatter diagram
X
High standard deviation
data are more spread out
Low standard deviation
clustered around the mean
Extrapolation
estimating an unknown value based on extending the values
Dangers of extrapolation
- can be unreliable, since trend might not continue (especially when there are disparities in the existing data sets)
. Extrapolation doesn’t account for qualitative values that can trigger changes in future values within the same observation. It hardly accounts for causal factors in the observation.
How t9 know is PMCC value is wrong
If it is greater than 1
State two variables from the large data set that are not suitable to be modelled by a normal distribution.
- daily wind speed (Beaufort) since it is qualitative data
- rainfall (since not symmetric)
Comment on the suitability of Sara’s sampling method of this study
Too little days measures or data
Suggest how Sara could make better use of the large data set for her study
Use more data from more of the UK locations and more of the months
From your knowledge of the large data set, explain why this process may not generate a large enough sample size
In the large data set, some days might have gaps because the data was not recorded
Big IQR/range or spread
Larger standard deviation
If median increases but mean (22.5) is same. Suggest values
Both values must be greater than median and values must add to 45
Additional values are added. Explain why the standard deviation will be lower
Both values must be less than 1 standard eviction from the mean
Non random sampling methods
Opportunity or quota
Why might a stratified random sampling not be used?
It is not possible to have a sampling frame
Qualitative variables
Wind speed
Smaller mean but larger standard deviation