Lecture 18 - Resampling Statistics Flashcards
Why were resampling statistics introducted?
- Most of our stats tests are based on equations developed in 1800-1930
- Developed by talented mathematicians calculating probabilities using maths models (on pen and paper i.e. computations had to be simple enough)
- As a result, each test is based on one particular model of the underlying data:
- Sometimes the model makes a lot of assumptions
- Sometimes it makes fewer assumptions (but is usually weaker)
- Resampling techniques represent a novel method that is assumption-free(er) but retains power (don’t need equations at the same level and don’t require assumptions but maintain power of parametric tests)
Why use resampling techniques?
Fewer assumptions:
- So more accurate if assumptions not met
Very general:
- A few basic ideas that can be modified and reused
- No equations or tables to look up – the maths is actually easier
- Thinking about the test forces us to think about our data (what null hypothesis means), might realise the problem might be in the data
Why are resampling approaches not more popular?
- They are new (1979 is recent for stats) and assumed (incorrectly?) to be more complex
- Parametric stats do a reasonably good job, and are discussed in simple (ish?) language in textbooks
- Resampling does require:
- A computer (not widely available in 1979) e.g. have to calculate a mean say 10000 times
- Some programming (not available in SPSS)
- A lot of people don’t like thinking about their data
What are permutation tests?
- One common use for resampling
- For comparing groups/conditions (e.g. t-test replacement)
- Shuffle data according to your conditions
How is resampling used for hypothesis testing?
- The point of inferential statistics:
- To determine the probability that the differences we measured were caused by sampling error (results in sample vs rest of world)
- The principle of resampling techniques is to measure that sampling error by repeating the sampling process a large number of times:
- We can determine the likely error introduced by the sampling by looking at the variability in the resampling
What are between-subject randomisation tests?
- Example question: are Smurfs just dwarfs painted blue?
- Experiment: measure heights and check for difference
- Null hypothesis: Smurfs and dwarfs have the same height
- Would this generalise to a whole population or is it just this seven?
What is the process of between-subjects tests?
- We want to determine the likelihood of getting differences this extreme if the data all came from a single ‘population’
- Simulate running the experiment many times, with the data coming from one population; check what range of values commonly occur
- In practice: keep the measured values but shuffle them (randomly assigning them to the two groups); count how often the difference between the new means is bigger than between the measures means?
- We assume that these are real and sensible values
- We do not assume anything about their distribution (assume numbers are valid)
- Repeat process a large number of times
Between subjects tests summary
- Repeat simulated experiment a large number of times, forcing the null hypothesis to be true, and check how extreme the real value was
- No equation needed, except for the statistic of interest, e.g. mean
- No table needed: the data, themselves, give the p value
What is the generalisation of between-subject tests?
- If our hypothesis is that the groups differ in diversity (SD) rather than the mean
- Randomise
- Repeat process a large number of times
- We do not need a whole new test if we change our opinion of what is interesting in the data
- We do not need a parametric and a non-parametric version of the test
- Very similar approach for within-subjects design
What is a within-subjects randomisation test?
- H0: taking steroids has no effect on a dwarf’s height
- Now the populations that we randomise are within subjects: in each resample, the values are shuffled for each subject, rather than across the whole dataset: we just randomise the sign of the difference for each pair
How is number of participants accounted for?
- t-tests use n (number of subjects) in their equation
- How is that accounted for here?
- The sample size for the resamples has to be the same as the original data
- The variance in the mean differences will automatically reflect the number of subjects
- With 100 people (rather than 10), its unlikely one person will have a big effect (so null distribution becomes tighter)
What is bootstrap resampling?
- For generating confidence intervals (e.g. make error bars)
- Resample-with-replacement the values in a sample
What are bootstrap resamples?
- Bootstrap resamples can be used to calculate confidence intervals such as:
- Confidence interval of a mean (e.g. 95%)
- Standard error of the mean
- They can also determine whether some test value is inside or outside the 95% confidence interval (like a one-sample test)
- They can be used for confidence of simple values (like mean) or for fitted parameters (like gradient of a line)
Bootstrap example
- Different type of resample – can’t just shuffle (will be same mean)
- Everything on the left appears once, more than once or not at all (resample uses same numbers but not necessarily all of them, randomly select one replacement)
- Resampling with replacement
- Distributions of means of your groups (not distribution of individual people)
How do you calculate SEM?
- SD of means (how variable would your mean have been)
- The SEM is the standard deviation of the means of all possible samples
- It can be estimated from the standard deviation of the bootstrap means
- In our example:
- SEM based on the formula: 4.27
- SEM based on the bootstrap resamples: 4.10
- To know the true SEM, would have to actually rerun study on full population repeatedly
- The difference is due to the fact that they are both estimates, calculated in two different ways
- Very skewed sample = believe bootstrap
How do you calculate confidence intervals?
- The 95% confidence interval from the bootstraps represents the range of values that 95% of the means take
- It is calculated from the bootstrapped means by ordering them and cutting off the highest and lowest 2.5%
- In our example, the 250th value is 103, the 99750th value is 119: so, the mean is 111.2 and the 95% confidence interval is 103-119
One-sample test
- Let us consider the hypothesis that this population is ‘above average IQ’. Then we might expect: IQ>100
- Null hypothesis: IQ<=100, i.e. we want to know whether the mean IQ is significantly greater than 100
- If H0 if true, how likely was this data? i.e. how likely is a mean of 100 or less for our population?
- We can simply count how often it occurs within the bootstraps:
- Order the data and find how many values are <=100
- In our example: 45 values <= 100, so p= 45/10000 = 0.0045 < 0.05
- Note: we could have run a one-sample t-test
- We get: t=2.62, df=9, p=0.014 (SPSS reports 0.028 for a 2-tailed test but ours was 1-tailed)
- Different values because two different estimates of how likely the null hypothesis is to give this data
How do you bootstrap with a model fit?
- Comparing the mean to a specific value is effectively having a very simple model of the world
- Bootstrapping generalizes easily to more complex models than just the mean
- E.g. we could fit a straight line to some data and use bootstrapping on the values of the gradient
What are advantages of bootstrap?
- Very general method: any type of model can be used and confidence intervals on any of its parameters can be estimated
- Can also be used to perform hypothesis testing (for one-sample tests)
- Not based on any assumptions about the data
- No tables, no equations (except for the model)
What are other resample approaches?
Jack-knife
- Similar to bootstrap but rather than ‘randomly sampling with replacement’, resampling is done by ‘selecting all data except one’ (find out how much impact each individual has)
- Can be done without a computer (not good reason to do with a computer as can easily do 10000 resamples with a computer)
Monte-Carlo method
- Create data based on model simulations and compare these to real data
- E.g. if a neuron has a certain spike rate and a ‘Poisson’ spike generating mechanism what is the chance of seeing a particular pattern of spikes
What are some issues and concerns with resampling?
- How many data samples (participants) do I need? = no a priori answer, try and see. Run study ahead and see effect size then see how many participants you need
- How many resamples must I generate? = 1,000 – 10,000 depending on how accurate you want p (if you’ve not got enough, you get a different p value each time)
- Which type of resampling should I use? = whatever best simulates the original sampling (force null hypothesis to be true and maintain as much of the original info as possible)
- What if my data are not representative of the population? = garbage in = garbage out (same with the t-test – none of these tests can fix garbage data)
Resampling summary
- (1) Simulate recollecting the data
- Where Null hypothesis was true to show the Null Distribution (previously calculated by statisticians)
- From a single sample to generate confidence intervals
- (2) If the original data don’t look likely from your Null Distribution then the Null Hypothesis is presumed not to be a good model of your data
- (3) These tests make very few assumptions about your data (unlike parametric tests)
- (4) They don’t throw away information (unlike rank-based non-parametric tests)
- Maintain original power