Topic 12 Flashcards
Why might we use a Chi square test instead of a T test
This is because the other tests such as sample T tests only measure quantitative data, whereas Chi square tests are able to measure qualitative data. Additionally, it allows for testing of more than 3 categories
What are the different functions of the Chi square tests?
(3 different ways)
Goodness of fit
Homogeneity
Independence
How can χ2 tests be used for goodness of fit
Goodness of a fit tests whether the observed values match up with the expected values.
They test a hypothesis about the distribution (model) of a qualitative variable in a population
I.e. do eye colours follow the following distribution? Brown 45%, blue 27%, hazel 18%, green 10%, etc and tthen compare this to the observed value (these are the expected values)
How can χ2 tests be used to test for homogeneity
Tests a hypothesis about the distribution of a qualitative variable in several populations
How can χ2 tests be used to test for independence
Tests a hypothesis about the relationship between two qualitative variables in a population - i.e. whether they are independent or not (is there an association between the two)
I.e. is there an association between parent eye colour (qualitative) and child eye colour (qualitative)
How is the test statistics for all of the tests above calculated
χ2 (test statistic) = sum of ( [observed frequency - expected frequency] ^2 / expected frequency )
What are the hypotheses for the chi square test for goodness of fit
H0 - model fits data/expected frequencies.
H1 - Model doesn’t fit data
What are the assumptions involved in chi square tests for goodness of fit
None of the expected categories have a value of 0, and no more than 20% of the expected values are less than 5 (Cochran’s rule)
What is Cochran’s rule
No more than 20% of the expected values are less than 5 - in other words, we want at least 80% of the results with an expected value of greater than 5
How do we calculate the number of degrees of freedom from a chi square test
n-1
n = number of categories)
How do we find the p-value from a chi square test
We use χ2 (n-1) curve to find upper tail area, n = number of categories
How would we use chi square test to test for independence between two variables
We typically represent the data between two qual variables and two qual variables in a contingency table before putting it into a mosaic plot.
What is the code for the chi square test
chisq.test(dataset)
What are the hypotheses involved with the chi square test for independence
H: H0 - variables are independent
H1 - variables aren’t independent
What are the assumptions involved with the chi square test for independence
Expected categories - none are empty, and no more than 20% are <5 (Cochran’s rule) - follows same assumptions as the chi square test for goodness of fit
How do we calculate the test statistic from chi square test for independence
χ2 = sum of ([Observed frequency - expected frequency]) ^2 / expected frequency
How do we calculate p - values for chi square test for independence
We use the χ2 curve to find upper tail area, however, this has a limit of degrees of freedom (df) of (m-1) (n-1), where m = number of categories from variable 1, and n = number of categories from variable 2
How do we calculate degrees of freedom for chi square test for independence
We can use (m-1)(n-1) = df, where m = number of categories from variable 1 (i.e. a row), and n = number of categories from variable 2 (i.e. a column)
What happens when df = 1, or there is a 2 x 2 contingency table
When this occurs, R automatically applies the Yates continuity correction onto the p - value obtained.
What does Yates continuity correction do
It aims to make the test more conservative in terms of p values especially for small sample sizes or small number of categories as it could result in an increase in bias. This ultimately helps reduce the number of type 1 errors (false positives)
How can we turn off Yates continuity correction
We can say correction = FALSE
So what are the differences in the HATPC process for the use of chi square test for independence vs goodness of fit
Only differences are in the hypotheses, and the way the degrees of freedom are calculated
Is it possible to use T tests to test the significance of a linear slope
Yes, it is possible, and it is often used
What are the hypotheses typically involved in testing for the significance of the slope
H0 = no significant linear trend
H1 = significant linear trend
(normally)
What are the assumptions involved in testing for the significance of a line)
Residuals need to be independent
Residuals should follow a normal distribution
Residuals should have a constant variance
Relationship between dependent and independent variable should look linear
How can we check for homoscedasticity
Check residual plot for no observable pattern
How can we check for residuals following normal distribution
Use a QQ plot and also a SHapiro Wilk test
How can we check that residuals are independent
Check residual plot
How can we check that there is a linear relationship between dependent and independent variables
Check the scatter plot for a linear relationship
What is the test statistic for testing for significant linear relationship
T = (OV - EV) / SE, with n-2 degrees of freedom
How is p - value obtained for linear relationship
We do the test statistic with n-2 degrees of freedom, and find the tail areas
We can also find it by doing the summary() output to give us the values that we want