Statistics exam 4 Agresti Flashcards
Regression, non parametrics, ANOVA
What is ANOVA and when is it used?
Analysis of variance
Comparing quantitative response variables that have a categorical explanatory variable
What is the difference between a one- two- three-way ANOVA?
One: 1 independent variable in a between groups design
Two: factorial 2x2 design
Three: factorial design 2x3x3
What is the difference between variability between and within?
Between: distance between tops of distributions
Within: distance within a distribution
What does var. between > var. within mean?
There is a true difference between the groups
What type of distribution is used for ANOVA and how does it look?
F-distribution
- One right tail
- High F = small p value
What are the assumptions for an ANOVA test?
- Quantitative variable in more than 2 groups
- Independent random sampling
- Equal standard deviations (largest sd < 2x smallest sd)
- Normally distributed
- Equal n (for now)
What do the hypotheses for ANOVA look like?
H0 = mu1 = mu2 = …. mu g
HA = at least 2 population means are different
What are the steps for calculating F statistic in ANOVA test?
- Calculate within variability
- Calculate between variability
- Fill in in F statistic formula
How do you calculate the p-value in ANOVA testing?
1-F.DIST (F ; df1 ; df2 ; true)
What is the conclusion if p < alpha in ANOVA test?
At least 2 groups differ, but you don’t know which ones
What is MS and SS?
MS: mean squares = variability within and between
SS: sum of squares = MSg or MSe times the df1 or df2
What is the fisher method in ANOVA?
The confidence interval of ANOVA testing. If you have 3 groups, you have 3 intervals
This confidence interval is more narrow than the normal confidence interval for t distribution
Why would you use the fisher method and not doing three times the t-distribution?
It capitalizes on chance. By doing the test over and over again, the chance of a type I error (alpha) increases
What is the Bonferroni method?
Adviced alpha = used alpha / number of tests (K)
It corrects for capitalization on chance for doing t-tests over and over again
What is an alternative for the Bonferroni method?
Tukey method
When do you use non-parametric tests?
When central limit theorem isn’t met, because groups are too small. No normal distribution
How do you deal with ties in non-parametric tests?
Average the ranks the ties would get
What are the three types of non parametric tests and when do you use them?
- Wilcoxon: non parametric t test for comparing 2 means
- Kruskal Willis: non parametric anova test for between groups/factorial designs
- Sign test: for paired observations/ dependence/ paired t-test / pre-posttest design / matched individuals
What are the assumptions for the Wilcoxon test?
- Rank ordered
- 2 independent samples
- No assumptions regarding the distribution
What do the hypotheses for the Wilcoxon test look like?
H0: equal expected values for sample mean ranks and identical population distribution
H1: different expected values for sample mean ranks (two sided)
H1: higher/lower expected values for sample mean ranks (one sided)
What distribution can you use for samples larger than 20 in a Wilcoxon test? What do you have to do in other cases?
Use z distribution if n >20
In other cases: W = average (treatment) - average (control). Read the P-value from a sampling distribution
What is sample space in the Wilcoxon test? What is thought of these possibilities under H0?
All possible rank combinations.
All these possibilities are equally likely under H0
What distribution does the Kruskal-Wallis test use?
Chi square distribution
What are the assumptions for the sign test?
- Small n, not normally distributed
- Random sampling
- Unequal values for each pair (no equal pre/posttest values)
What do the hypotheses of a sign test look like?
H0: P (+) = 0,5
H1: P (+) =/ 0,5 (two sided)
H1: P (+) > 0,5 (one sided)
H1: P (+) < 0,5
What distribution does the sign test use?
The normal z-distribution
What is the difference between a regression line and a correlation?
Regression line predicts the value for a response variable
Correlation indicates strength of the association
What is extrapolation?
Using regression line to predict y for x outside of the range of the data
In what case is r = b?
If the data have the same variabilities for variables
What is a residual?
Distance between data and regression line
What happens to b and r when the scale changes?
b changes
r doesn’t change, because it’s standardized
How do you calculate the correlation in excel?
Function CORREL(select both columns)
What is R squared?
Proportion of variation in y values that is accounted for by the linear relationship of x and y
It describes the predictive power
= proportional reduction in error
What is the case for R squared = 0?
All values of estimated y are the same (horizontal line)
Are correlation and regression line resistant to outliers?
No
What is a lurking variable?
Variable that influences association between variables of primary interest. It has the potential to be confounding
What is the Simpson paradox?
Interpreting association wrongly and not taking in account several classes within the association.
Reversal of direction association after adjusting for lurking variable
What is regression towards the mean?
Extreme values tend to be less extreme over time
R < 1: so y is always relatively closer to the mean than x is to its mean
if x is 2 sd away and r = 0,5, y is 0,5 * 2 = 1 sd away
What is the difference between the residual and the total? How do you summarize this?
Residual = distance data to regression line
Total = distance data to mean
Summarize by squaring the sum of both totals (RSS and TSS)
You look if the regression line predicts the data better than the mean
What does this mean:
Sum (y-yhat)^2 < Sum (y-ymean)^2 or RSS < TSS? What does this mean for R square?
If RSS < TSS: strong association. The regression line is a better predictor
- R square is large
What happens with R square when:
RSS = TSS
RSS = 0
0 < R < 1
RSS = TSS –> R square = 0 (b = 0)
RSS = 0 –> R square = 1 (the best!)
0<R<1 –> 0<Rsquare<1
What does R square = 0,5 mean?
The error using regression line yhat to predict y is 50% smaller than using ybar to predict y
50% of total variance explained
Variance around regression line is 50% less than total variance
What is ecological fallacy?
Using correlation to predict values for a specific individual. This can be very dangerous
What are the assumptions for regression analysis?
- Population has linearity
- Data is randomly gathered
- For each x, y follows normal distribution
- The standard deviation for y should be the same for all values of x
What distribution does regression analysis use?
T-distribution