ANOVA and Regression Flashcards

1
Q

Define the terms scatter plot, correlation, and regression line.

A

Scatter plot- a 2-dimensional graph of data values.

Correlation- A statistic that measures the strength and direction of a linear relationship between two quantitative variables

Regression line- an equation that describes the average relationship between a quantitative response variable and an explanatory variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Pearson’s sample correlation coefficient (r), what are its bounds, and how is it calculated?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are typical questions to ask from a scatter plot

A
  1. What is the average pattern? Does the scatter plot look like a straight line or curved? 2. What is the direction of the pattern? Negative/ Positive association? 3. How much do individual points vary from the average pattern? 4. Are there any unusual data points?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the meaning if r= 1,0,-1?

A

All points fall on a straight positive line, the best straight line through the data is exactly horizontal, and all points fall on a straight negative line.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Equation for a straight regression line

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Three general types of regression?

A

Simple linear regression, ploynomial regression, multiple linear regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Assumptions for error term in simple linear model

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are some topics of interest in regression?

A
  1. Is there a linear relationship? 2. How to describe the relationship 3. How to predict new value 4. How to predict the value of explanatory variable that causes a specified response
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the E[Yi] for a simple linear regression model

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Definitions of B1 and B0

A

B1- the slope of the regression line which indicates the change in the mean of the probability distribution of Y per unit increase in X B0- the intercept of the regression line. If 0 is in the domain of X then B0 gives the mean of the probability distribution of Y at X=0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Are Y, X, B, eps random/fixed and known/unknown?

A

Y- Random, known X- Fixed, known B- Fixed, unknown eps- Random, unknown

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Describe the process of least squares estimation

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Equation for a residual

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Sxx, Syy, Sxy

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Gauss-Markov Theorem

A

Under certain assumptions (mean zero, independent, homoskedastic errors) the least squares estimators are the minimum variance unbiased estimators among all linear estimators

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Best equations for B0 and B1 using least squares estimation

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

For simple linear regression, equation for SSE, degrees of freedom, relation to sig^2

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Maximum likelihood estimation, explain what changes with regression from LSE.

A

MLE assumes normality. B estimators are the same but estimators for sig^2 differ. We get SSE/n for MLE which is biased, but asymptotically unbiased. Normal assumption necessary for testing and interval construction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

J and n in terms of 1 vectors

A

J- 11’ n-1’1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

H matrix

A

X(X’X)^-1 X’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Linear form of y

A

By

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Quadratic form of y

A

y’Ay

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Quadratic forms are common in linear models as a way of _____ The sum of squares can be decomposed in terms of _______ A quadratic form of normal Y is _______ Independence of quadratic forms is based on _________

A

expressing variation quadratic forms Chi-squared distribution idempotent matrices

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

If l1=B1y and l2=B2y then what is cov(l1,l2)

A

cov(l1,l2)=B1cov(y)B2’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What does the trace function do?

A

It is the sum of the diagonals of a square matrix

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

If q=y’Ay where Y~N(u,V) then E[q]=

A

E[q]=u’Au+tr(AV)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Matrix expression of LSE

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

(X’X)^-1

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Matrix expression of e and var(e)

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Matrix expression of SST, SSE, SSR

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

E[Bhat] and Var[Bhat]

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

For y~N(u,V), if l=By and q=y’Ay with A symmetric and idempotent, then how to show l and q are independent?

A

Show BVA=0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

For y~N(u,V), q1=y’A1y and q2=y’A2y then how to show q1 and q2 are independent.

A

Show A1VA2=0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

For y~N(0,V), q=y’Ay then q~____ where ____ idempotent

A

Chi-squared with rank(A) AV idempotent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

For y~N(0,1), q=y’Ay then how to obtain t distribution

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

How to obtain F distribution (two ways)

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Why use centered regression?

A

Centered regression (xi-xbar) helps to reduce the ill effect caused by high correlations among the columns (covariates) this is collinearity and det(X’X)=0 so therefore (X’X)-1 does not exist. “collinearity” means a “near-linear” relationship (high correlation coefficient) among covariate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Cov(B*_0,B*_1), [centered regression]

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

t distribution and statistic for simple linear regression

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What to show for a t-distribution (3 things)

A

1) the numerator is distributed normally 2) the denominator is distributed chi-square 3) the numerator is independent from the denominator

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

CI for B0 and B1 in simple linear regression

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Testing procedure for if multiple slopes are zero

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

ANOVA table for simple linear regression

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

R^2 (two ways)

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

R^2 in centering

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

What are the two meanings for prediction

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

What distribution does B_0 hat + B_1 hat x_new follow?

A
48
Q

(1-alpha)% CI for E(Y|X=x_new)

A
49
Q

(1-alpha)% PI for E(Y|X=x_new)

A
50
Q

What in words is the Bonferroni correction method?

A

Divide the alpha level by m where m is number of confidence intervals for which simultaneous coverage is desired

51
Q

Explain the Scheffe method

A
52
Q

Explain in words the assumption of linearity and how to check

A

The assumption that a linear model is actually a good fit for the data. Can be checked by inspecting scatter plots for a linear relationship between that variable and the response as well as by inspecting residual plots for patterns

53
Q

Explain in words the assumption of Randomness and how to check

A

The assumption that there is not structure to the residuals (no pattern). The Runs test is a nonparametric method to determine structure in the residuals by counting the number of sequences of points above or below the mean/median residual. The Durbin-Watson test is applicable if the data can be arranged in time order. The test has a table and tests if correlation = 0.

54
Q

Explain in words the assumption of homoskedasticity (constant variance) and how to check

A

The assumption of constant variance for the residuals. This can be tested using a scatter plot, a residual plot, or using the BF test (also called Levene’s for groups) and the BP test for general constant variance

55
Q

Explain in words the assumption of Normality of error and how to check

A

Normality of error can be tested by plotting the residuals using a box-plot, histogram, or normal probability plot. This can also be tested formally using the Shapiro-Wilks test, Kolmogorov-Smirnov test, or the Anderson-Darling test. Note however that normal probability plots provide no information if the assumptions of linearity or homoskedasticity have been violated

56
Q

Define an influential point

A

One that simultaneously has a large absolute residual and high leverage

57
Q

Define leverage

A

Leverage is the effect of that point on the regression and the leverage of the ith point can be found via element hii of the hat matrix

58
Q

Three types of residuals and their definitions

A
59
Q

Influence: how to measure, factors of influence, situations for high influence, measuring high influence

A

Cook’s distance measures influence. It depends on two factors- leverage and size of the residual. There are three situations which can cause high influence: high residual+moderate leverage, high leverage+moderate residual, or high both. There is a large Cook’s distance if Di > Falpha, p, n-p

60
Q

General rule of thumb for large residual

A
61
Q

General depiction of Lack of fit Test

A
62
Q

Model for Lack of fit Test

A
63
Q

ANOVA for LoFT

A
64
Q

SSPE and SSLOF in matrix notation

A
65
Q

Remedial approaches (for heterogeneity of variance and nonlinearity combinations)

A

If nonlinearity but homogeneity of variance: change model

Linearity but heterogeneity of variance: WLS or transform

Heterogeneity of variance and nonlinearity: Apply a transformation

If right skew data with heterogenity and nonlinearity: log transformation

If count data: square root transformation

If proportions: arcsin square root transformation

66
Q

Box-Cox transformation (what it accomplishes, how, when)

A
67
Q

Prediction intervals for a transformed model and confidence intervals for a transformed model*

A
68
Q

What to do when needing inference without normality assumption and procedure to accomplish this.

A

Bootstrap. Procedure: 1) Take a sample of n from dataset with replacement 2) Compute the statistic of interest on that dataset (usually using mean to compute parameter [x(x’x)^-1x’]) 3) Repeat N times and order the N results 4) Depending on alpha, find percentiles and compute

69
Q

Weighted Least squares: how to accomplish, when to accomplish

A
70
Q

Prediction interval and confidence intervals for WLS

A
71
Q

What is the model for polynomial regression

A
72
Q

General ANOVA test procedure for polynomial relationship

A
73
Q

In regression, what are possible repercussions for including a higher ordered term/not including a higher ordered term when there shouldn’t be one/should be one.

A

If a higher orrder term is included that isn’t in the true relationship then the result is a higher prediciton variance by the unbiased estimators. If a higher order term which should be there is not included then the estimators are no longer unbiased

74
Q

Model for Multiple linear regression

A
75
Q

What theorem still applies in Polynomial and multiple linear regression?

A

Gauss-Markov Theorem

76
Q

General ANOVA test procedure for multiple linear regression

A
77
Q

SSR(A|B)= (2 ways)

A

SSR(A,B)-SSR(B)

SSE(B)-SSE(A,B)

78
Q

R2y,x1|x2 (2 interpretations/definitions)

A
79
Q

What is the hat fact?

A

HRHF=HFHR=HR

80
Q

What are added variable plots (partial regression plots) and what is a valuable heuristic from their evaluation

A

They are plots of the two sets of residuals ei(Y|Xk) and ei(Y|Xm). If there is a nice linear relationship in the added variable plot, one should add Xz into the model

81
Q

Describe the 3 Sequential Variable Selection Methods and their various measurements of selection

A
  1. Forward selection (start with the null model and add the best variable individually)
  2. Backward selection (start with the full model and subtract the worst variable individually)
  3. Stepwise selection (start with the null model and add/subtract to maximize the desired measurement criteria)

Measurement criteria: Adj R2, Mallow’s Cp, AIC/BIC

82
Q

Adjusted R2 definition (2 ways)

A
83
Q

Mallow’s CP definition and how to evaluate

A
84
Q

Collinearity definition

A

Collinearity is a “near-linear” relationship (a high correlation coefficient) among covariates. It increases the variance of the estimators.

85
Q

What does standardized regression accomplish and how does it do this?

A
86
Q

VIF definition for two variables (2 ways) and rule of thumb for VIF indication

A
87
Q

Some indicators of collinearity

A
88
Q

4 important remarks on R2

A

1) Not an estimate of any population quantity unless the data are multivariate normal
2) Can be dramatically changed by how the x’s are selected
3) Does not capture nonlinear relationships, only linear ones
4) Non-decreasing in the number of predictors. Adding an extra predictor will not cause R2 to decrease

89
Q

Regression model using data from two sources (Different intercepts but same slope)

A
90
Q

How to pic AIC/ BIC

A

Go with the smallest

91
Q

Given a contingency table (what does this look like?) how would one attain a proportionary table and also calculate pij and pj|i?

A
92
Q

Two categorical response variables are independent (in a contingency table) if…

A

All joint probabilities equal the product of their marginal probabilities (pij=p.jpi. for all i,j)

93
Q

3 measures of relationships for square contingency tables

A

Difference in porportions, relative risk, odds

94
Q

For a 2x2 proportion table, define difference in proportions for fixing column 1 or fixing row 1, the range, and when there is statistical independce of row/col classification

A
95
Q

Relative risk definition, range, meaning of 1, and comparison to difference in porportions

A
96
Q

Definition of odds, range, inverse relationship to proportion of success

A
97
Q

Large sample distribution of log(theta hat [estimated odds ratio]), difference of porportions, and log(r hat [estimated relative risk])

A
98
Q

Definition of odds ratio (various ways to calculate), relation to relative risk, all possible calculations given IJ table

A
99
Q

3 ways to test independence for contingency tables and how

A
100
Q

Test of Goodness of Fit

A
101
Q

Test of Homogeneity

A
102
Q

Test of Symmetry (matched pairs test, McNemar’s Test)

A
103
Q

Simpson’s paradox

A

Occurs when the data are incorrectly grouped together without the relevant factor. It calls for a higher dimensional table to truly address the problem

104
Q

Form of GLM and threee components

A
105
Q

Definition of GLM

A

GLMs extend ordinary regression models to encompass nonnormal response distributions and modeling function of the mean

106
Q

The response varibale in a GLM follows a distribution in the _____. What is the formula for each yi, and the canonical link

A
107
Q

When Y={0,1} what GLM to use? What is this process?

A
108
Q

When Y={0,1,2,…} what GLM to use? What is this process?

A
109
Q

Deviance for Poisson and Binomial GLM

A
110
Q

Confidence for Poisson and Binomial GLM

A
111
Q

Testing procedure for slope of Weighted Least squares

A
112
Q

Matrix formula for r2y,xk|{xi≠k}

A
113
Q

Regression model using data from two sources (Different intercepts and slope)

A
114
Q

Regression using data from two sources (Different intercepts and same slope) testing procedure that there is only one regression line

A

H0: There is only one regression line (B2=0)

H1: There are two regression lines with different intercepts (B2≠0)

Test statistic is distributed tn+m-3

115
Q

Regression using data from two sources (Different intercepts and slopes) testing procedure that there is

1) Same intercept
2) Same slope
3) Same slope and intercept
4) The two lines are connected at x=c

A
116
Q

What is the definition of a contingency table

A

A rectangular table having I rows for X categories and J columns for Y categories. The cells contain frequency counts of outcomes for a sample. The IxJ table is also called a cross classification table.