Statistics with SAS on Coursera: Week 4 Flashcards

Question 1

Q

What are the assumptions when performing a regression analysis?

Answer

A

1) A linear relationship fits the data adequately
2) The errors are normally distributed with a mean of 0
3) The errors have equal variance at each value of the predictor value
4) The errors are independent

Question 2

Q

What should the scatter plot look like when testing that a linear model fits the data adequately?

Answer

A

The data should hover around the regression line.

Check for possible outliers influencing the slope of the line as well as non-linear patterns in the data such as curvilinear relationships or autocorrelated data common in time series data.

Question 3

Q

What is a residual?

Answer

A

A residual is the the difference between each observed value of Y and its predicted value.

Question 4

Q

How can you check for violations of equal variances, linearity, and independence?

Answer

A

Examine the shape of the scatter in the residuals versus predicted values chart.

You want to see a random scatter of the residual values above and below the reference line at 0.

No patterns should be visible in the residuals.

This indicates that the model assumptions are valid.

Question 5

Q

What does it mean when you see pattern or trends in the residual values?

Answer

A

The linear regression models are not valid. The model might have problems.

Question 6

Q

What does the Q-Q plot look like when the residuals are normally distributed?

Answer

A

The Q-Q plot should appear to be a straight, diagonal line if the residuals are normally distributed,

Question 7

Q

Given the properties of the standard normal distribution, between which two values would approximately 95% of the studentized residuals fall?

Answer

A

-2 and 2

If we think of these STUDENT residuals as following the standard normal distribution and apply the 68/95/99% rule, we would expect 5% of them to fall beyond the -2, +2 limits, by chance.

Question 8

Q

What’s the difference between an outlier and an influential observation?

Answer

A

An outlier is an unusual data point, whereas an influential observation is an unusual data point that singlehandedly exerts influence on the regression model.

Question 9

Q

What might parts of your model might be affected by influential observations?

Answer

A

Influential observations could affect the model coefficients, the standard errors, or the predicted values.

For example, if deleting an observation results in a large change in parameter estimates, then that observation has a significant influence on the parameters.

If deleting an observation results in a change in the standard errors, then the observation influences the precision of the parameters.

Question 10

Q

Which diagnostic statistic may be used to detect outliers?

Answer

A

The STUDENT residuals (also known as studentized or standardized residuals)

Question 11

Q

What diagnostic statistics may be used to detect influential observations?

Answer

A

Cook’s D statistics, RSTUDENT residuals, and DFFITS statistics.

Question 12

Q

What diagnostic statistic do you use to determine which predictor variable is being influenced?

Answer

A

DFBETAS (difference in betas)

Question 13

Q

What does it mean if the RSTUDENT value differs from the STUDENT residual?

Answer

A

The observation is probably influential

Question 14

Q

Which statistic is most useful for identifying influential observations for explanatory models when the purpose of your model is parameter estimation?

Answer

A

Cook’s D

Question 15

Q

What does Cook’s D measure?

Answer

A

The Cook’s D statistic measures the distance between the set of parameter estimates with that observation deleted from your regression analysis, and the set of parameter estimates with all the observations in your regression analysis.

Question 16

Q

How can you determine the influential observations when examining the Cook’s D plot?

Answer

Study These Flashcards

A

Look for observations above the horizontal cutoff line.

Question 17

Q

Which statistic is best for identifying influential observations when building a predictive model?

Answer

Study These Flashcards

A

DFFITS is most useful for predictive models

Question 18

Q

Which test should you use to determine which predictor variable is being influenced?

Answer

Study These Flashcards

A

DFBETAS, which stands for Difference in Betas

Question 19

Q

What is stored in the automatic macro variable _GLSIND generated when running PROC GLMSELECT?

Answer

Study These Flashcards

A

The list of effects selected by PROC GLMSELECT

Question 20

Q

What should do you do with influential observations?

Answer

Study These Flashcards

A

1 ) Re-check for data entry errors

2) Consider whether or not a you have an adequate model and consider whether or not to change the model to accommodate unusual observations such as adding a categorical predictor to distinguish among groups
3) Determine whether or not the influential observation is valid, and only unusual.

Question 21

Q

What problem do you have when two or more predictor variables are strongly correlated with one another?

Answer

Study These Flashcards

A

Collinearity, also called multicollinearity, indicates that you have redundant information in your model.

When multiple variables try to explain the same variation in the response, it leads to inflated standard errors and instability in the regression model.

Question 22

Q

If there is no correlation among the predictor variables, can there still be collinearity in the model?

Answer

Study These Flashcards

A

Collinearity occurs when there is correlation present among predictor variables in the model. If the predictor variables are not correlated, then there’s no collinearity present in the model.

Question 23

Q

When you use honest assessment, which of the following would be considered the best model?

Answer

Study These Flashcards

A

The best model is the simplest (the most parsimonious) model that has the best performance on the validation data. The training data is used to fit the model and generate the possible models to be assessed.

Statistics with SAS on Coursera: Week 4 Flashcards

Study materials associated with Week 4 of the Statistics with SAS course offered on Coursera. (23 cards)