Data description, populations and the normal distributions Flashcards
Two measures of “spread” in a sample?
IQR and range (maximum-minimum)
Two ways of using histograms?
Show raw frequency or as a proportion of total number of observations
The only parameters of a normal distribution?
Mean (μ) and standard deviation (σ)
Calculating SD for a sample?
Remove m from each value and square the result; then add these, divide by N+1, and square root. If did not square then signs would mean that = 0 as would cancel. Dividing by n stops SD simply increasing as sample gets bigger.
Why is median better than mean for skewed data?
Median will not be unduly affected by a select few very large values; mean will be.
AUC up to X?
= P; the probability that in individual will have a (height) below X. The total AUC is therefore 1. The value of P corresponding to a given X is the “cumulative probability of the distribution at X”. The probability that a value is above X is obviously 1-P.
Symmetry of the curve for AUC?
If Y is X units below the mean, then P at Y = 0.24 (for example). Because normal distribution is symmetrical, going X units above the mean gives probability of being ABOVE this value of 0.24 also. The probability of being between Y and Z is therefore 1-(Z+Y)!
What is the inverse cumulative probability?
Allows you to use AUC of normal distribution to find the height where (3%) of boys are shorter.
How does P depend on the equation for X, where X= μ+Zσ?
Only through Z: this means that if change population via σ and μ, then P for the same value of Zσ away from μ (even though X itself will be different) is the same.
Significance of P only depending on μ and σ through Z?
Means that for Z of 1.96, P=0.975 i.e. 97.5%. This means that 2.5% of people have height >X, and also that 2.5% of people have height less than μ-Zσ. This means that 95% of people have height within μ±1.96σ. This is basis of Z scores!
Use of Z and μ and σ for quartiles?
For Z of 0.675, get P of 0.75 and therefore % within ZSDs of μ is (2*0.75)-1=0.5. The IQR is therefore (μ+0.675σ)-(μ-0.675σ) = 1.35σ.
Why is IQR better than range for estimating spread?
As IQR = 1.35σ for any sample, can estimate σ by dividing IQR by 1.35. However, for the range (maximum-minimum), 1.35 will no longer be a constant but depends on sample size (expected range will be bigger from sample of 1000 rather than sample of 10). This means that the range not only reflects the spread of a sample, but also the sample size and so is not good for estimating population spread.
Sample means, normality and sample size?
Even for skewed data, sample means are often normally distributed. Become closer to normal as sample size increases.
Using Z score to estimate third centile of a population?
Do m-1.88s; gives you proportion where 3% are lower. This is because get P of 0.97 so (P*2)-1=0.94 (3 either side). Only works if normal! But gives you the precision of counting a much larger sample.
Assuming normality for data that can only be positive?
Can be a problem; want negative values in this situation to have low probablity (1-5%). As 16% of a population falls below μ-σ, but only 2.5% below μ-2σ, then if estimated mean (m) is below estimated SD (s), normal probably inappropriate. As m becomes ~2* s, becomes okay, as low probability of negative values.