The analysis of categorical data Flashcards
Categorical and ordinal data?
If not continuous, then is categorical. If ordered, i.e. tumour stage I-IV, then is ordinal. If not e.g. ABO then unordered. Unlike continuous data, a patient who is stage IV is not twice as bad as stage II.
Parameters in binary data?
Only thing that can change is the % of the population that have an attribute (π). Can also be interpreted as the probability that a randomly chosen member of the population has an attribute. As before, π is unknown.
Estimating π?
Very simple. If have n patients, r will have the attribute and n-r will not. The estimator of π is simply r/n or 100*r/n (as a percentage). Although this is effectively a mean for the 0s and 1s, we need different methods.
Why do we need new methods for binary data?
To do with SD. In continuous variables, have μ (estimated by sample mean m) and σ (which is independent of it). In binary, there is only one parameter which is clearly an analogue of μ. Must have SD that relates to the mean.
SD/SE in binary?
Look at how well r/n estimates π: as this incorporates sample size it is SE-like. Continuous variable SE = σ/√n; for binary do √[π(1-π)/n]. Both have n on denominator so SE shrinks with larger sample. Key difference is that once π has been estimated there are no other variables needed for spread. This is why methods for continuous variables must change.
Example for why binary SE formula used?
If have π=0, it is impossible for r (the sample) to be anything other than 0 and so the SE must be 0. This is why *π is in the numerator. Similarly, if π=1 then r=1 and there can be no error. This is why *(1-π) is in the numerator: again gives 0 and no error.
What is χ2 test an analogue of?
Unpaired T test!
Null hypothesis in χ2?
That π1=π2 i.e that there is no difference in the two populations from which the attributes are present. π looks at proportion; work out how many would expect to (die) if population proportions were equal.
χ2 size and sign?
Can never be negative, and only 0 when tables exactly identical.
Things to remember about χ2?
Must be counts, not %. This is because must account for sample sizes i.e. 2/10 not the same strength as 200/1000, and the latter will be much more sensitive to departures from expected. Must also be independent counts i.e. remember counting children again and again. Otherwise will make P value much more convincing purely because the counts are larger, but the difference is no more real. A useful way to check this is to make sure that in the margins, the bottom right number i.e. the grand total is the same as the number of independent units.
When is χ2 not appropriate?
When expected values are less than 5 in a 2*2 table. If table is larger, then if over 20% of cells have E values below 5, or any with E below 1. Use Fisher’s instead!
Why is Fisher’s different to others?
Calculates P statistic directly from the data, rather than using score like χ2 or T score
Why is it called Fisher’s exact?
No need for the asymptotic approximation seen in χ2, just use actual tables.
Why not use Fisher’s all the time?
Computing power. Also, get wide CIs.
Finding probability from odds?
If odds = 2, then probability = 2/(1+2)