STATS Flashcards
What is a population?
The whole set of items of interest
What is a census?
A census observes every member of a population
What is a sample?
A selection of observations taken from the subset of a population to find out information about the population as a whole
Pros and cons of a census
It should give a completely accurate result
Time consuming
Cannot be used when testing involves destruction of item
Hard to process large data
Pros and cons of a sample
Less time consuming than a census
Fewer people required to respond
Less data to process than in a census
Data may not be as accurate as census
Sample may not be large enough to give information about small sub groups of population
How can the size of a sample affect its validity
Larger sample - more accurate, more resources needed
If population is varied, larger sample is needed as opposed to if population is uniform
Different samples lead to different conclusions due to natural variation in population
What is a sampling unit?
Individual units of a population
What is a sampling frame?
A list in which sampling units of a population are named/numbered
Characteristics of random sampling
+ what are the three types
Every member of the population has an equal chance of being selected - sample is therefore representative, and should be free of bias
Simple random
Systematic
Stratified
How to perform simple random sampling
+pros/cons
Use a sampling frame, e.g. a list of the units. Each unit is allocated a specific number and a number is selected at random. Can be done using a random number generator (if you generate a repeated number, ignore and go again)
No bias, easy to implement for small populations/samples, each unit has a known and equal chance of selection.
not suitable for large populations/sample sizes
Sampling frame required
How to carry out systematic sampling
+pros/cons
Required units are chosen at regular intervals from a randomly ordered list.
If a sample of size 20 was required, choose a random number between 1 and 5, then continue to pick each 5th item.
Simple, suitable for large samples/populations
Sampling frame needed, bias could be present if list isn’t randomly ordered
How to carry out stratified sampling
+pros/cons
Divide population into mutually exclusive strata, e.g. males and females, and a random sample is taken from both
Sample accurately reflects population structure. Proportional representation guaranteed.
Population must be clearly classified into distinct strata. Selection in each stratum suffers same cons as simple random
Types of non random sampling
Quota sampling - interviewer selects a sample that reflects characteristics of the whole population. Divide population into groups according to given characteristic, the size of each group determines the proportion of the sample with each characteristic. Interviewer would meet people, asses their group, interview them then allocate into correct quota. Repeat until each quota is filled. If someone refuses to answer just onto the next.
Opportunity sampling - Taking sample from people available at the time who fit the criteria. e.g. standing outside tesco to ask ppl if they shop at tesco 3x a week
Pros/cons of quota sampling
Allows for comparison between diff groups
Allows a small sample to still be representative of population
No sampling frame needed
Quick easy cheap
Non random sampling can introduce bias
Population must be divided into groups - cld be costly or inaccurate
Increasing scope of study increases number of groups - can be costly/time consuming
non responses aren’t recorded
Pros/cons of opportunity sampling
Easy to carry out, inexpensive
Unlikely to provide a representative sample, results depend on interviewer (chariz)
Quantitative vs Qualitative data
Discrete vs continuous
Quantitative - variables/data associated with numerical values
Qualitative - associated with non numerical values
Continuous - can take any value in a given range
Discrete - can only take specific values
What is variance?
remember proof
The average squared distance from the mean
(A measure of spread that takes all values into account,+ the fact that each data point varies from the mean by an amount x-x̄)
https://www.youtube.com/watch?v=9EgRztlWQH4
Variance equation + units
Variance (σ^2) = Σ(x-x̄)²/n
= Sₓₓ/n (Sₓₓ is a summary statistic)
or x²/n - (x/n)²
units are in units of data squared
What is standard deviation?
The square root of the variance σ
Variance for grouped data
Σf(x-x̄)² ÷ Σf
basically find mean of data, then find mean distance squared and divide by frequency
Effects of different linear coding on standard deviation and mean
Addition/subtraction - increases mean by added amount. no change to standard deviation
Multiplication/division - Scales mean up to multiplied value, as well as standard deviation
Any transformation done to standard deviation will be squared to variance
note squaring all values won’t square the mean
For a random sample of size n
every member of the population is equally
likely to be included
all subsets of the population of size n must be
possible
or that
● every possible sample of size n must be
equally likely to occur.
Snowball sampling
Interview one person who refers another person to be sampled, can be one person (linear) to multiple (exponential)
Used when participants are hard to find + reach hidden populations
Short duration
May be only able to reach out small population. Can lead to sampling bias
Exponential discriminante when only one is recruited from an exponential sample
Cluster sampling
Cluster sampling is where a population is split into groups and then only one of the groups is used as the sample
Quota and convenience sampling
Quota sampling is similar to stratified except the members from each group are not chosen randomly. For example, 15 fish from a lake are required for a sample. 10 should be trout and 5 should be cod. The person doing the experiment might just use the first 10 trout and the first 5 cod they catch as their sample.
Convenience sampling is where the person doing the experience using whatever is the easiest method. For example, they could ask the first 100 people to walk past them on a street.
Simple vs unrestricted random sampling
Simple - subject selected once
Unrestricted - subject selected multiple times (without replacement)
Plotting cumulative frequency vs polygon
Cumulative is lowest value, polygon is midpoint
What to comment on when comparing data sets
Measures of location
Measures of spread
Correlation only used
With linear relation, variables with no linear correlation could still have a relationship
Bivariate data
Data with pairs of values for two variables
Least squares regression line
Line that minimises sum of the squares of the distances from each data point to the line