The analysis of categorical data Flashcards
Categorical and ordinal data?
If not continuous, then is categorical. If ordered, i.e. tumour stage I-IV, then is ordinal. If not e.g. ABO then unordered. Unlike continuous data, a patient who is stage IV is not twice as bad as stage II.
Parameters in binary data?
Only thing that can change is the % of the population that have an attribute (π). Can also be interpreted as the probability that a randomly chosen member of the population has an attribute. As before, π is unknown.
Estimating π?
Very simple. If have n patients, r will have the attribute and n-r will not. The estimator of π is simply r/n or 100*r/n (as a percentage). Although this is effectively a mean for the 0s and 1s, we need different methods.
Why do we need new methods for binary data?
To do with SD. In continuous variables, have μ (estimated by sample mean m) and σ (which is independent of it). In binary, there is only one parameter which is clearly an analogue of μ. Must have SD that relates to the mean.
SD/SE in binary?
Look at how well r/n estimates π: as this incorporates sample size it is SE-like. Continuous variable SE = σ/√n; for binary do √[π(1-π)/n]. Both have n on denominator so SE shrinks with larger sample. Key difference is that once π has been estimated there are no other variables needed for spread. This is why methods for continuous variables must change.
Example for why binary SE formula used?
If have π=0, it is impossible for r (the sample) to be anything other than 0 and so the SE must be 0. This is why *π is in the numerator. Similarly, if π=1 then r=1 and there can be no error. This is why *(1-π) is in the numerator: again gives 0 and no error.
What is χ2 test an analogue of?
Unpaired T test!
Null hypothesis in χ2?
That π1=π2 i.e that there is no difference in the two populations from which the attributes are present. π looks at proportion; work out how many would expect to (die) if population proportions were equal.
χ2 size and sign?
Can never be negative, and only 0 when tables exactly identical.
Things to remember about χ2?
Must be counts, not %. This is because must account for sample sizes i.e. 2/10 not the same strength as 200/1000, and the latter will be much more sensitive to departures from expected. Must also be independent counts i.e. remember counting children again and again. Otherwise will make P value much more convincing purely because the counts are larger, but the difference is no more real. A useful way to check this is to make sure that in the margins, the bottom right number i.e. the grand total is the same as the number of independent units.
When is χ2 not appropriate?
When expected values are less than 5 in a 2*2 table. If table is larger, then if over 20% of cells have E values below 5, or any with E below 1. Use Fisher’s instead!
Why is Fisher’s different to others?
Calculates P statistic directly from the data, rather than using score like χ2 or T score
Why is it called Fisher’s exact?
No need for the asymptotic approximation seen in χ2, just use actual tables.
Why not use Fisher’s all the time?
Computing power. Also, get wide CIs.
Finding probability from odds?
If odds = 2, then probability = 2/(1+2)
Finding odds from probability?
If probability = 2/3, then odds =(2/3)/(1-2/3)=2=2:1
Three ways for describing a difference between parameters π1 and π2?
- Absolute difference D = π1-π2
- Relative risk R = π1/π2
- Odds ratio = (calculate odds from π normally i.e.)
(π1/(1-π1))/(π2/(1-π2))
Null hypothesis for absolute difference, relative risk and odds?
D = 0, R = 1, OR = 1
Significance of CIs for D?
If P<0.05 for π1=π2, then 95% CI will necessarily not include the null value, 0.
Standard error for OR?
First: SE(LnOR) = √(1/34)+(1/68)+etc. The square root of the sum of the reciprocals seen in the table.
Confidence intervals for lnOR?
Calculate OR, then lnOR, then SELnOR. 95% CIs = lnOR+/-1.96*SELnOR then do antilogs of the CIs.
Confidence interval significance for lnOR?
Again, if P<0.05, then 95% CIs will not include the null value i.e. 1.
Midpoint of CIs?
For most things, the midpoint of the 95% CIs will be your point estimate. This is not the case for OR.
What does interaction mean?
Effect of one variable depends on the level of another i.e. effect modifier
Why does difference in P values re interaction not mean that there is a difference?
Because P value is a composite of standard error and treatment effect.
Calculating SE of difference in means?
As means themselves have their own SEs (se1 and se2), SEdiff=√se1(squared)+se2(squared)
Selection of subgroups problems?
Must choose which are to be analysed before the start of the study. Otherwise, if have ten variables, there are 45 potential interactions so reasonable chance that one would be significant.