Chapter18 Flashcards
Chi-square distribution
a probability distribution of the sum of squares of several normally distributed variables. It tends to be used to test hypotheses about categorical data, and to test the fit of models to the observed data.
Chi-square test
although this term can apply to any test statistic having a chi-square distribution, it generally refers to Pearson’s chi-square test of the independence of two categorical variables. Essentially it tests whether two categorical variables forming a contingency table are associated.
Contingency table
a table representing the cross-classification of two or more categorical variables. The levels of each variable are arranged in a grid, and the number of observations falling into each category is noted in the cells of the table. For example, if we took the categorical variables of glossary (with two categories: whether an author was made to write a glossary or not), and mental state (with three categories: normal, sobbing uncontrollably and utterly psychotic), we could construct a table as in the textbook. This instantly tells us that 127 authors who were made to write a glossary ended up as utterly psychotic, compared to only 2 who did not write a glossary.
Cram√©r’s V
a measure of the strength of association between two categorical variables used when one of these variables has more than two categories. It is a variant of phi used because when one or both of the categorical variables contain more than two categories, phi fails to reach its minimum value of 0 (indicating no association).
Fisher’s exact test
Fisher’s exact test (Fisher, 1922) is not so much a test as a way of computing the exact probability of a statistic. It was designed originally to overcome the problem that with small samples the sampling distribution of the chi-square statistic deviates substantially from a chi-square distribution. It should be used with small samples.
Goodman and Kruskal’s Œª
measures the proportional reduction in error that is achieved when membership of a category of one variable is used to predict category membership of the other variable. A value of 1 means that one variable perfectly predicts the other, whereas a value of 0 indicates that one variable in no way predicts the other.
Loglinear analysis
a procedure used as an extension of the chi-square test to analyse situations in which we have more than two categorical variables and we want to test for relationships between these variables. Essentially, a linear model is fitted to the data that predicts expected frequencies (i.e., the number of cases expected in a given category). In this respect it is much the same as analysis of variance but for entirely categorical data.
Odds ratio
the ratio of the odds of an event occurring in one group compared to another. So, for example, if the odds of dying after writing a glossary are 4, and the odds of dying after not writing a glossary are 0.25, then the odds ratio is 4/0.25 = 16. This means that the odds of dying if you write a glossary are 16 times higher than if you don’t. An odds ratio of 1 would indicate that the odds of a particular outcome are equal in both groups.
Phi
a measure of the strength of association between two categorical variables. Phi is used with 2 x 2 contingency tables (tables which have two categorical variables and each variable has only two categories). Phi is a variant of the chi square test, X², (above image) in which n is the total number of observations.
Saturated model
a model that perfectly fits the data and, therefore, has no error. It contains all possible main effects and interactions between variables.
Yates’s continuity correction
an adjustment made to the chi-square test when the contingency table is 2 rows by 2 columns (i.e., there are two categorical variables both of which consist of only two categories). In large samples the adjustment makes little difference and is slightly dubious anyway (see Howell, 2012).