Week 9, 10, 11, 12 Flashcards
Expected contingency table
The contingency table you expect from a population where the null hypothesis is true.
Hypothesis testing for categorical data is based on __
contingency tables
Difference between 1 way and 2 way expected contingency tables
1 way contingency looks at if there are differences in the counts between levels of a variable
-Expectation: Counts are distributed equally among cells - take total count and divide by amount of levels
2 way contingency looks at if the counts are independent between the variables
-Expectation: Counts are distributed independently among cells
How to calculate independence for 2 way tables
Calculating independence requires first calculating the marginal distribution as proportions
The expected table is the product of the row and column proportions for each cell, multiplied by the table total
What does the chi-square score measure
Measures the distance between observed and expected contingency tables. It works by calculating the squared difference between the two tables on a cell-by-cell basis
The four steps in calculating the chi-square score
Take the difference between each observed and expected cell
Square the difference
Divide by the expected value
Sum over all cells in the table
Chi Square Distribution
The null distribution for hypothesis testing with categorical data
It is the distribution of chi-square scores you would get from sampling an imaginary statistical population where the null hypothesis was true
ONLY POSITIVE VALUES
Degrees of freedom for 1 way tables vs 2 way tables
1 way: df=k-1
Number of cells minus 1
2 way: df=(r-1)(c-1)
(rows-1 multiplied by columns-1)
Statistical test conclusions for chi square scores
-Reject null hypothesis if X^2 observed> X^2 critical or is p<alpha
Fail to reject the null hypothesis if X^2 observed< X^2 critical or is p>alpha
Reports for chi square tests should include:
Name of Test
Degrees of freedom
Total count in the observed table
The observed chi-squared value (two decimal places)
P-value (three decimal places)
Write out the formula for the t-observed for correlation
Refer to formula sheet
Reporting of correlation test should include
Symbol for the test
Degrees of freedom
Observed correlation value
P-value (three decimal places)
Correlation
Evaluate the association between two numerical variables (looking for a pattern)
No implied causation between the variables
Both variables are assumed to have variation
Not used for prediction
Pearson’s correlation coefficient (r, p) measures the strength of the association
Write out the formula for r
refer to formula sheet
Difference between correlation and linear regression
Correlation cannot predict, linear regression can (experimental studies)
What is the linear equation? Describe the components
y=a+b(Xi)
Slope (b):
-Amount that the response variable (y) increases of decreases for every unit change in the predictor variable (x)
Intercept (a)
-The value of the response variable (y) when the predictor variable (x) is at 0
Statistical Model
-3 componentS
Systematic Component: describes the mathematical function used for predictions
Linear equation
Random component: describes the probability distribution for sampling error (linear regression is Normal Distribution)
Error distribution
Only occurs in response variable
Link Function: connects the systematic component to the random component
How to calculate the sum of squares
- Calculate the residual for each data point
- Take the square of each residual
- Sum the squared residuals across all data points
- Divide by the degrees of freedom
Null and alternative hypothesis for linear regression
Intercept:How does the intercept (a) relate to a reference value (Ba)
Slope:How does the slop (b) relate to a reference value (Bb)
What is the null distribution in linear regression hypothesis tests
a t-distribution
What are the 4 main assumptions of linear regression
- Linearity
- Response variable a linear combination of the predictor variable
Y=a+bx is a straight line - Independence
The residuals along the predictor variable should be independent of each other
Evaluated qualitatively using a plot of residuals against the predictor variable - Normality
Residual variation should be Normally distributed
Evaluated using the Shapiro-Wilks Test - Homoscedasticity
The residual variation should be similar across the range of the predictor variable
F test
evaluate the difference in variance between two groups
Null and alternative hypothesis of F-test
Ho= ratio of variances is one
Ha= ration of variances is not 1
What to report for F-score
Mean, sd, sample size for each group
Observed F score
Degrees of freedom for each group
P-value
Single Factor ANOVA
Used to work with numerical and categorical variables
Group variation: the variation among means of the categorical levels
Residual variation: the variation among sampling units within a categorical variable
ANOVA evaluates whether there is a difference in means among categorical levels
Post-Hoc tests
Secondary statistical test designed to indicate which groups have different means
ONly used if the ANOVA F-test indicates to reject the null hypothesis
Two Factor ANOVA
Two categorical variables and their interaction
Two factor ANOVA is used to anser 3 questions
Main effects A
-Differences among the levels of factor A averaging across the levels of factor B
Main effects B
Differences among the levels of factor B averaging across the levels of factor A
Interactions
Differences among the levels of one factor within each level of the other factor
Cell by cell comparisons
Interactions are deviations from____
additivity: When the effect of the levels are their simple sum