Statistics exam 4 Agresti Flashcards

Regression, non parametrics, ANOVA

1
Q

What is ANOVA and when is it used?

A

Analysis of variance

Comparing quantitative response variables that have a categorical explanatory variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the difference between a one- two- three-way ANOVA?

A

One: 1 independent variable in a between groups design

Two: factorial 2x2 design

Three: factorial design 2x3x3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the difference between variability between and within?

A

Between: distance between tops of distributions

Within: distance within a distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does var. between > var. within mean?

A

There is a true difference between the groups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What type of distribution is used for ANOVA and how does it look?

A

F-distribution
- One right tail
- High F = small p value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the assumptions for an ANOVA test?

A
  • Quantitative variable in more than 2 groups
  • Independent random sampling
  • Equal standard deviations (largest sd < 2x smallest sd)
  • Normally distributed
  • Equal n (for now)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What do the hypotheses for ANOVA look like?

A

H0 = mu1 = mu2 = …. mu g
HA = at least 2 population means are different

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the steps for calculating F statistic in ANOVA test?

A
  1. Calculate within variability
  2. Calculate between variability
  3. Fill in in F statistic formula
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How do you calculate the p-value in ANOVA testing?

A

1-F.DIST (F ; df1 ; df2 ; true)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the conclusion if p < alpha in ANOVA test?

A

At least 2 groups differ, but you don’t know which ones

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is MS and SS?

A

MS: mean squares = variability within and between

SS: sum of squares = MSg or MSe times the df1 or df2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the fisher method in ANOVA?

A

The confidence interval of ANOVA testing. If you have 3 groups, you have 3 intervals

This confidence interval is more narrow than the normal confidence interval for t distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Why would you use the fisher method and not doing three times the t-distribution?

A

It capitalizes on chance. By doing the test over and over again, the chance of a type I error (alpha) increases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the Bonferroni method?

A

Adviced alpha = used alpha / number of tests (K)

It corrects for capitalization on chance for doing t-tests over and over again

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is an alternative for the Bonferroni method?

A

Tukey method

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

When do you use non-parametric tests?

A

When central limit theorem isn’t met, because groups are too small. No normal distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How do you deal with ties in non-parametric tests?

A

Average the ranks the ties would get

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the three types of non parametric tests and when do you use them?

A
  • Wilcoxon: non parametric t test for comparing 2 means
  • Kruskal Willis: non parametric anova test for between groups/factorial designs
  • Sign test: for paired observations/ dependence/ paired t-test / pre-posttest design / matched individuals
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are the assumptions for the Wilcoxon test?

A
  • Rank ordered
  • 2 independent samples
  • No assumptions regarding the distribution
20
Q

What do the hypotheses for the Wilcoxon test look like?

A

H0: equal expected values for sample mean ranks and identical population distribution

H1: different expected values for sample mean ranks (two sided)

H1: higher/lower expected values for sample mean ranks (one sided)

21
Q

What distribution can you use for samples larger than 20 in a Wilcoxon test? What do you have to do in other cases?

A

Use z distribution if n >20

In other cases: W = average (treatment) - average (control). Read the P-value from a sampling distribution

22
Q

What is sample space in the Wilcoxon test? What is thought of these possibilities under H0?

A

All possible rank combinations.
All these possibilities are equally likely under H0

23
Q

What distribution does the Kruskal-Wallis test use?

A

Chi square distribution

24
Q

What are the assumptions for the sign test?

A
  • Small n, not normally distributed
  • Random sampling
  • Unequal values for each pair (no equal pre/posttest values)
25
Q

What do the hypotheses of a sign test look like?

A

H0: P (+) = 0,5
H1: P (+) =/ 0,5 (two sided)
H1: P (+) > 0,5 (one sided)
H1: P (+) < 0,5

26
Q

What distribution does the sign test use?

A

The normal z-distribution

27
Q

What is the difference between a regression line and a correlation?

A

Regression line predicts the value for a response variable
Correlation indicates strength of the association

28
Q

What is extrapolation?

A

Using regression line to predict y for x outside of the range of the data

29
Q

In what case is r = b?

A

If the data have the same variabilities for variables

30
Q

What is a residual?

A

Distance between data and regression line

31
Q

What happens to b and r when the scale changes?

A

b changes
r doesn’t change, because it’s standardized

32
Q

How do you calculate the correlation in excel?

A

Function CORREL(select both columns)

33
Q

What is R squared?

A

Proportion of variation in y values that is accounted for by the linear relationship of x and y
It describes the predictive power

= proportional reduction in error

34
Q

What is the case for R squared = 0?

A

All values of estimated y are the same (horizontal line)

35
Q

Are correlation and regression line resistant to outliers?

A

No

36
Q

What is a lurking variable?

A

Variable that influences association between variables of primary interest. It has the potential to be confounding

37
Q

What is the Simpson paradox?

A

Interpreting association wrongly and not taking in account several classes within the association.
Reversal of direction association after adjusting for lurking variable

38
Q

What is regression towards the mean?

A

Extreme values tend to be less extreme over time

R < 1: so y is always relatively closer to the mean than x is to its mean

if x is 2 sd away and r = 0,5, y is 0,5 * 2 = 1 sd away

39
Q

What is the difference between the residual and the total? How do you summarize this?

A

Residual = distance data to regression line
Total = distance data to mean

Summarize by squaring the sum of both totals (RSS and TSS)

You look if the regression line predicts the data better than the mean

40
Q

What does this mean:
Sum (y-yhat)^2 < Sum (y-ymean)^2 or RSS < TSS? What does this mean for R square?

A

If RSS < TSS: strong association. The regression line is a better predictor
- R square is large

41
Q

What happens with R square when:
RSS = TSS
RSS = 0
0 < R < 1

A

RSS = TSS –> R square = 0 (b = 0)
RSS = 0 –> R square = 1 (the best!)
0<R<1 –> 0<Rsquare<1

42
Q

What does R square = 0,5 mean?

A

The error using regression line yhat to predict y is 50% smaller than using ybar to predict y

50% of total variance explained

Variance around regression line is 50% less than total variance

43
Q

What is ecological fallacy?

A

Using correlation to predict values for a specific individual. This can be very dangerous

44
Q

What are the assumptions for regression analysis?

A
  • Population has linearity
  • Data is randomly gathered
  • For each x, y follows normal distribution
  • The standard deviation for y should be the same for all values of x
45
Q

What distribution does regression analysis use?

A

T-distribution