measures of spread Flashcards
1
Q
measures of spread
A
- range
- interquartile range
- deviation
- variance - sample and population
- standard deviation
2
Q
range
A
- the distance between its smallest and largest values
3
Q
interquartile range
A
- a slightly more useful measure than the range is the interquartile range (IQR)
- involves splitting the data into quarters:
- find the median to split the data in half
- split each of the halves into half again
- the IQR is the range covered by the middle 2 quarters (50%) of the data
4
Q
range and IQR
A
- the range and the IQR only tell us limited information
- two datasets can have the same range and IQR but still look very different
5
Q
deviation
A
- the range and IQR depend on only two points
- to get a more fine-grained idea of the spread we need to take every data-point into account
- one way to do this is to take each data-point and calculate how far it is away from some reference point, such as the mean
- this is known as the deviation
- once we have the deviation values, then what do we do with them?
- if we add them up then the sum will just be bigger whenever we have more data
- but it’s possible to have bunched up large datasets and spread out small datasets and our measure should be able to account for this.
- instead of adding up the deviations we could work out the average of the deviations
- but some deviations will negative and some will be positive, so they’ll just average up to 0
6
Q
squared deviations
A
- we can make sure all the deviations are positive by squaring the values
- the mean of the squared deviations will be the basis for our next measure of spread, the variance
7
Q
variance
A
- the population variance - the mean of the squared deviations from the population mean
- but we dont usually know the value of the population mean, so can we just use the sample mean instead?
8
Q
squared deviations from the population mean
A
- We’ll start off with a population where we know the population mean 100, and the variance of the population (225).
- We’ll take samples from this population, and work out the average of the squared deviations from the population mean.
- The value we calculate varies from sample to sample, but what does it do on average?
- We can repeat what we did with the sample mean and see what happens with the average squared deviations from the population mean.
- The running average of the average squared deviations from the population mean.
- On average the average of the mean squared deviation from the population mean will be equal to the variance of the population.
- Now let’s repeat the process but use the deviation from the sample mean instead
- Instead figure 5, we can see the running average of average squared deviations from the sample mean.
- Now we can see the problem of using deviation from the sample mean instead of deviation from the population mean.
- Our calculated value will on average not be the same as the variance of the population.
- So what’s the solution?
9
Q
sample variance
A
- when we only have access to information from the sample (e.g., sample mean) then we have to calculate a quantity known as the sample variance
- Dividing by N-1 rather than taking a simple average (dividing by N) means that on average the sample variance will be equal to the variance of the population.
10
Q
sample variance and population variance
A
- If you have access to the entire population (e.g., you can compute the population mean) then you can calculate the population variance (divide by N).
- If you can only have access to the sample characteristics (e.g., you can only calculate the sample mean) then you must calculate the sample variance (divide by N-1).
- The confusing part is the sample variance is an unbiased estimator of the variance of the population.
- This just means that the sample variance will coverage to the variance of the population.
- Using the population variance formula with sample values is a biased estimator of the variance of the population.
- This just means that it wont coverage to the variance of the population.
- Remember, what we really want to know are the features of the population (it’s mean and variance) but we need to estimate these from the sample.
11
Q
standard deviation
A
- the variance is a good measure of spread, and its a commonly used measure, but it can be a little difficult to interpret
- For example, think back to the salary example from lecture 6
- If salary is measure in USD
- Then the variance is measures in USD
- If salary is measure in USD
- fortunately there is a solution, just taje the square root of the variance
- this measure is called standard deviation
12
Q
why the squared deviations and not the absolute value?
A
- When we worked out the deviations, we squared them to turn the negative values into positive values.
- But could we just take the absolute value?
- Below we have two data sets made up of four data points each
- The data in A are more spread out than the data in B
- So lets calculate the average of the squared deviations and the average of the absolute value of the deviations.
- First of the data in A:
- The mean of the absolute deviations is - 70
- The mean of the squared deviations is - 7400
- Then the data in B:
- The mean of the absolute deviations is - 70
- The mean of squared deviations is - 4900
- So even though the two sets if data have different amounts of spread, the mean of absolute deviations doesn’t pick it up, but the mean of the squared deviations does.
13
Q
the relationship between samples and populations
A
- Now that we have tools for describing the centre/typical value of a set of measurements (mean) and the spread of a set of measurements (variance/standard deviation) we can these two ideas together.
- In lecture 6 we saw that individual sample means were spread out around the population mean.
- We can quantify that spread using the idea of the standard deviation.
- But we’re no longer calculating the spread of our sample or even the spread of the population.
- We’re now calculating the spread of sample means around the population mean.
- This kind of standard deviation has a special name - the standard error of the mean.
14
Q
the standard error of the mean
A
- the standard error of the mean in technical terms is the standard deviation of the sampling distribution of the mean
- to fully appreciate the concept of the standard error of the mean we’ll need to understand the concept of the sampling distribution
- and to understand the sampling distribution we’ll first need to understand what distributions are, what they look like, and why they look the way they do