Lecture 11- The Chi Square Distributions Flashcards
Chi-square distributions
are used for several different types of tests including goodness-of-fit tests, tests in population variance; but our concern in this lecture is the use of Chi-sq tests to conduct tests of independence.
Chi-square distributions explained
If we were able to classify the members of a population according to two attributes, the aim of tests of independence is to determine whether the attributes are independent of each other or have some bearing on each other.
*We make use of a concept we encountered in probability theory -Contingency Tables
is a tabular presentation of the results of a random sample that relates to two random variables.
Contingency tables-example
Consider the table of sample data on the next slide.
This data relates to A’ Level grades (otherwise called the performance) for a random sample of sixth form students drawn from four of the top secondary schools (PC, CIC, NGHS & BAHS) in Trinidad and Tobago.
Due to the manner in which the data is recorded, such a table is called a contingency table.
PC CIC NGHS BAHS Total Grade A 40 25 15 10 90 Grade B 20 10 10 5 45 Grade C 40 15 5 5 65 Total 100 50 30 20 200
In our example, there are two random variables: (1) The A’ Level Grade of the student and (2)The Secondary School Attended by the student.
*The interesting question about this table is: was the performance influenced by the school attended?
*In other words, do we have evidence here to suggest that the row variable (performance) is influenced by the columns (school attended)?
*Put yet another way, are the rows independent of the columns?
*This question is the basis of the very famous 2 (Chi Square) test of statistics.
Generic form of the hypotheses
In other words, this sample is to be used to test :
*The null hypothesisthat the row variable is independent of the column variable.
*The alternative hypothesisthat the row variable is dependent on the column variable.
Accordingly the related test is called a test of
independence.
Test of the independence
The requirements of a Test of Independence are not different from earlier tests, namely:
–Null hypothesis
–Alternative hypothesis
–Significance level
–Test Statistic
–Critical Region
–Conclusion
the Null Hypothesis
*H0: Row Variable is independent of the Column Variable
The Alternative Hypothesis
*H1: Row Variable is dependent on the Column Variable
The Multiplication Law
The Multiplication Law: P(B and C) = P(B|C) x P(C)
*Do you remember how we defined an independent event?
*Two events are said to be independentif the occurrence of one does not affect the probability of the occurrence of the other, for example, P(B | C) = P( B)
*Therefore, the special case of the multiplication law of probability states that given any two independent events B and C from the same sample space,
P(B and C) = P(B) x P(C).
The Chi Square Distribution
Go to the Chi Square Distribution Tables (look now at Table 8, which is the third of your Statistical Tables)
*You can see the shape of the Chi-Square Distribution which is positively skewed
*The columns give you the “100 percentage points” –in other words, for a 5% level of significance, you look at the column labelled 0.05, and so on.
*The Chi Square Distribution has only one parameter, i.e. the degrees of freedom.
*The Chi Square test statistic possesses a Chi Square Distribution with ( r –1 ) x (c –1) degrees of freedom.
*The rows give “V degrees of freedom”. What does this mean?
The Chi Square Distribution
The next step of Hypothesis Testing is to find the Critical Region or the Rejection Region, against which we compare the Test Statistic
*Let us choose a 5% significance level
*In our example, the d.f. = 6
*For 6 d.f.
and a 5% significance
level, we have a value
of 12.592
Limitations of the Chi Square Test
The test is limited to two variables.
*The contingency tables must be at least 2 rows and 2 columns.
*Too many cells with expected frequency less than 5 limit the accuracy of the decision arising from the test. Accordingly, the number of cells with expected frequency less than 5 must be limited to 20% of all cells; otherwise, the decision will be invalid.
*The quality of the decision is influenced by the quality of the data collection.
Note that we can accommodate the contingency table with 2 rows and 2 columns by applying a Yates Correction.
*Yates Correction involves subtracting 0.5 from the absolute difference between observed and expected, before squaring. The changes are negligible when dfincreases.
*Should the number of cells with expected frequency less than 5 exceed 20% of all cells, adjoining rows and columns must be merged and the test repeated on the amended contingency table.