ANOVA and Regression Flashcards
Define the terms scatter plot, correlation, and regression line.
Scatter plot- a 2-dimensional graph of data values.
Correlation- A statistic that measures the strength and direction of a linear relationship between two quantitative variables
Regression line- an equation that describes the average relationship between a quantitative response variable and an explanatory variable
What is Pearson’s sample correlation coefficient (r), what are its bounds, and how is it calculated?
What are typical questions to ask from a scatter plot
- What is the average pattern? Does the scatter plot look like a straight line or curved? 2. What is the direction of the pattern? Negative/ Positive association? 3. How much do individual points vary from the average pattern? 4. Are there any unusual data points?
What is the meaning if r= 1,0,-1?
All points fall on a straight positive line, the best straight line through the data is exactly horizontal, and all points fall on a straight negative line.
Equation for a straight regression line
Three general types of regression?
Simple linear regression, ploynomial regression, multiple linear regression
Assumptions for error term in simple linear model

What are some topics of interest in regression?
- Is there a linear relationship? 2. How to describe the relationship 3. How to predict new value 4. How to predict the value of explanatory variable that causes a specified response
What is the E[Yi] for a simple linear regression model
Definitions of B1 and B0
B1- the slope of the regression line which indicates the change in the mean of the probability distribution of Y per unit increase in X B0- the intercept of the regression line. If 0 is in the domain of X then B0 gives the mean of the probability distribution of Y at X=0
Are Y, X, B, eps random/fixed and known/unknown?
Y- Random, known X- Fixed, known B- Fixed, unknown eps- Random, unknown
Describe the process of least squares estimation
Equation for a residual
Sxx, Syy, Sxy
Gauss-Markov Theorem
Under certain assumptions (mean zero, independent, homoskedastic errors) the least squares estimators are the minimum variance unbiased estimators among all linear estimators
Best equations for B0 and B1 using least squares estimation
For simple linear regression, equation for SSE, degrees of freedom, relation to sig^2
Maximum likelihood estimation, explain what changes with regression from LSE.
MLE assumes normality. B estimators are the same but estimators for sig^2 differ. We get SSE/n for MLE which is biased, but asymptotically unbiased. Normal assumption necessary for testing and interval construction
J and n in terms of 1 vectors
J- 11’ n-1’1
H matrix
X(X’X)^-1 X’
Linear form of y
By
Quadratic form of y
y’Ay
Quadratic forms are common in linear models as a way of _____ The sum of squares can be decomposed in terms of _______ A quadratic form of normal Y is _______ Independence of quadratic forms is based on _________
expressing variation quadratic forms Chi-squared distribution idempotent matrices
If l1=B1y and l2=B2y then what is cov(l1,l2)
cov(l1,l2)=B1cov(y)B2’
What does the trace function do?
It is the sum of the diagonals of a square matrix
If q=y’Ay where Y~N(u,V) then E[q]=
E[q]=u’Au+tr(AV)
Matrix expression of LSE
(X’X)^-1
Matrix expression of e and var(e)
Matrix expression of SST, SSE, SSR
E[Bhat] and Var[Bhat]
For y~N(u,V), if l=By and q=y’Ay with A symmetric and idempotent, then how to show l and q are independent?
Show BVA=0
For y~N(u,V), q1=y’A1y and q2=y’A2y then how to show q1 and q2 are independent.
Show A1VA2=0
For y~N(0,V), q=y’Ay then q~____ where ____ idempotent
Chi-squared with rank(A) AV idempotent
For y~N(0,1), q=y’Ay then how to obtain t distribution
How to obtain F distribution (two ways)
Why use centered regression?
Centered regression (xi-xbar) helps to reduce the ill effect caused by high correlations among the columns (covariates) this is collinearity and det(X’X)=0 so therefore (X’X)-1 does not exist. “collinearity” means a “near-linear” relationship (high correlation coefficient) among covariate
Cov(B*_0,B*_1), [centered regression]
t distribution and statistic for simple linear regression
What to show for a t-distribution (3 things)
1) the numerator is distributed normally 2) the denominator is distributed chi-square 3) the numerator is independent from the denominator
CI for B0 and B1 in simple linear regression
Testing procedure for if multiple slopes are zero
ANOVA table for simple linear regression
R^2 (two ways)
R^2 in centering
What are the two meanings for prediction
What distribution does B_0 hat + B_1 hat x_new follow?
(1-alpha)% CI for E(Y|X=x_new)
(1-alpha)% PI for E(Y|X=x_new)
What in words is the Bonferroni correction method?
Divide the alpha level by m where m is number of confidence intervals for which simultaneous coverage is desired
Explain the Scheffe method
Explain in words the assumption of linearity and how to check
The assumption that a linear model is actually a good fit for the data. Can be checked by inspecting scatter plots for a linear relationship between that variable and the response as well as by inspecting residual plots for patterns
Explain in words the assumption of Randomness and how to check
The assumption that there is not structure to the residuals (no pattern). The Runs test is a nonparametric method to determine structure in the residuals by counting the number of sequences of points above or below the mean/median residual. The Durbin-Watson test is applicable if the data can be arranged in time order. The test has a table and tests if correlation = 0.
Explain in words the assumption of homoskedasticity (constant variance) and how to check
The assumption of constant variance for the residuals. This can be tested using a scatter plot, a residual plot, or using the BF test (also called Levene’s for groups) and the BP test for general constant variance
Explain in words the assumption of Normality of error and how to check
Normality of error can be tested by plotting the residuals using a box-plot, histogram, or normal probability plot. This can also be tested formally using the Shapiro-Wilks test, Kolmogorov-Smirnov test, or the Anderson-Darling test. Note however that normal probability plots provide no information if the assumptions of linearity or homoskedasticity have been violated
Define an influential point
One that simultaneously has a large absolute residual and high leverage
Define leverage
Leverage is the effect of that point on the regression and the leverage of the ith point can be found via element hii of the hat matrix
Three types of residuals and their definitions
Influence: how to measure, factors of influence, situations for high influence, measuring high influence
Cook’s distance measures influence. It depends on two factors- leverage and size of the residual. There are three situations which can cause high influence: high residual+moderate leverage, high leverage+moderate residual, or high both. There is a large Cook’s distance if Di > Falpha, p, n-p
General rule of thumb for large residual
General depiction of Lack of fit Test
Model for Lack of fit Test
ANOVA for LoFT
SSPE and SSLOF in matrix notation
Remedial approaches (for heterogeneity of variance and nonlinearity combinations)
If nonlinearity but homogeneity of variance: change model
Linearity but heterogeneity of variance: WLS or transform
Heterogeneity of variance and nonlinearity: Apply a transformation
If right skew data with heterogenity and nonlinearity: log transformation
If count data: square root transformation
If proportions: arcsin square root transformation
Box-Cox transformation (what it accomplishes, how, when)
Prediction intervals for a transformed model and confidence intervals for a transformed model*
What to do when needing inference without normality assumption and procedure to accomplish this.
Bootstrap. Procedure: 1) Take a sample of n from dataset with replacement 2) Compute the statistic of interest on that dataset (usually using mean to compute parameter [x(x’x)^-1x’]) 3) Repeat N times and order the N results 4) Depending on alpha, find percentiles and compute
Weighted Least squares: how to accomplish, when to accomplish
Prediction interval and confidence intervals for WLS
What is the model for polynomial regression
General ANOVA test procedure for polynomial relationship
In regression, what are possible repercussions for including a higher ordered term/not including a higher ordered term when there shouldn’t be one/should be one.
If a higher orrder term is included that isn’t in the true relationship then the result is a higher prediciton variance by the unbiased estimators. If a higher order term which should be there is not included then the estimators are no longer unbiased
Model for Multiple linear regression
What theorem still applies in Polynomial and multiple linear regression?
Gauss-Markov Theorem
General ANOVA test procedure for multiple linear regression
SSR(A|B)= (2 ways)
SSR(A,B)-SSR(B)
SSE(B)-SSE(A,B)
R2y,x1|x2 (2 interpretations/definitions)
What is the hat fact?
HRHF=HFHR=HR
What are added variable plots (partial regression plots) and what is a valuable heuristic from their evaluation
They are plots of the two sets of residuals ei(Y|Xk) and ei(Y|Xm). If there is a nice linear relationship in the added variable plot, one should add Xz into the model
Describe the 3 Sequential Variable Selection Methods and their various measurements of selection
- Forward selection (start with the null model and add the best variable individually)
- Backward selection (start with the full model and subtract the worst variable individually)
- Stepwise selection (start with the null model and add/subtract to maximize the desired measurement criteria)
Measurement criteria: Adj R2, Mallow’s Cp, AIC/BIC
Adjusted R2 definition (2 ways)
Mallow’s CP definition and how to evaluate
Collinearity definition
Collinearity is a “near-linear” relationship (a high correlation coefficient) among covariates. It increases the variance of the estimators.
What does standardized regression accomplish and how does it do this?
VIF definition for two variables (2 ways) and rule of thumb for VIF indication
Some indicators of collinearity
4 important remarks on R2
1) Not an estimate of any population quantity unless the data are multivariate normal
2) Can be dramatically changed by how the x’s are selected
3) Does not capture nonlinear relationships, only linear ones
4) Non-decreasing in the number of predictors. Adding an extra predictor will not cause R2 to decrease
Regression model using data from two sources (Different intercepts but same slope)
How to pic AIC/ BIC
Go with the smallest
Given a contingency table (what does this look like?) how would one attain a proportionary table and also calculate pij and pj|i?
Two categorical response variables are independent (in a contingency table) if…
All joint probabilities equal the product of their marginal probabilities (pij=p.jpi. for all i,j)
3 measures of relationships for square contingency tables
Difference in porportions, relative risk, odds
For a 2x2 proportion table, define difference in proportions for fixing column 1 or fixing row 1, the range, and when there is statistical independce of row/col classification
Relative risk definition, range, meaning of 1, and comparison to difference in porportions
Definition of odds, range, inverse relationship to proportion of success
Large sample distribution of log(theta hat [estimated odds ratio]), difference of porportions, and log(r hat [estimated relative risk])
Definition of odds ratio (various ways to calculate), relation to relative risk, all possible calculations given IJ table
3 ways to test independence for contingency tables and how
Test of Goodness of Fit
Test of Homogeneity
Test of Symmetry (matched pairs test, McNemar’s Test)
Simpson’s paradox
Occurs when the data are incorrectly grouped together without the relevant factor. It calls for a higher dimensional table to truly address the problem
Form of GLM and threee components
Definition of GLM
GLMs extend ordinary regression models to encompass nonnormal response distributions and modeling function of the mean
The response varibale in a GLM follows a distribution in the _____. What is the formula for each yi, and the canonical link
When Y={0,1} what GLM to use? What is this process?
When Y={0,1,2,…} what GLM to use? What is this process?
Deviance for Poisson and Binomial GLM
Confidence for Poisson and Binomial GLM
Testing procedure for slope of Weighted Least squares
Matrix formula for r2y,xk|{xi≠k}
Regression model using data from two sources (Different intercepts and slope)
Regression using data from two sources (Different intercepts and same slope) testing procedure that there is only one regression line
H0: There is only one regression line (B2=0)
H1: There are two regression lines with different intercepts (B2≠0)
Test statistic is distributed tn+m-3
Regression using data from two sources (Different intercepts and slopes) testing procedure that there is
1) Same intercept
2) Same slope
3) Same slope and intercept
4) The two lines are connected at x=c
What is the definition of a contingency table
A rectangular table having I rows for X categories and J columns for Y categories. The cells contain frequency counts of outcomes for a sample. The IxJ table is also called a cross classification table.