Statistics with SAS on Coursera: Week 4 Flashcards

Study materials associated with Week 4 of the Statistics with SAS course offered on Coursera.

1
Q

What are the assumptions when performing a regression analysis?

A

1) A linear relationship fits the data adequately
2) The errors are normally distributed with a mean of 0
3) The errors have equal variance at each value of the predictor value
4) The errors are independent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What should the scatter plot look like when testing that a linear model fits the data adequately?

A

The data should hover around the regression line.

Check for possible outliers influencing the slope of the line as well as non-linear patterns in the data such as curvilinear relationships or autocorrelated data common in time series data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a residual?

A

A residual is the the difference between each observed value of Y and its predicted value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How can you check for violations of equal variances, linearity, and independence?

A

Examine the shape of the scatter in the residuals versus predicted values chart.

You want to see a random scatter of the residual values above and below the reference line at 0.

No patterns should be visible in the residuals.

This indicates that the model assumptions are valid.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does it mean when you see pattern or trends in the residual values?

A

The linear regression models are not valid. The model might have problems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does the Q-Q plot look like when the residuals are normally distributed?

A

The Q-Q plot should appear to be a straight, diagonal line if the residuals are normally distributed,

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Given the properties of the standard normal distribution, between which two values would approximately 95% of the studentized residuals fall?

A

-2 and 2

If we think of these STUDENT residuals as following the standard normal distribution and apply the 68/95/99% rule, we would expect 5% of them to fall beyond the -2, +2 limits, by chance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What’s the difference between an outlier and an influential observation?

A

An outlier is an unusual data point, whereas an influential observation is an unusual data point that singlehandedly exerts influence on the regression model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What might parts of your model might be affected by influential observations?

A

Influential observations could affect the model coefficients, the standard errors, or the predicted values.

For example, if deleting an observation results in a large change in parameter estimates, then that observation has a significant influence on the parameters.

If deleting an observation results in a change in the standard errors, then the observation influences the precision of the parameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Which diagnostic statistic may be used to detect outliers?

A

The STUDENT residuals (also known as studentized or standardized residuals)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What diagnostic statistics may be used to detect influential observations?

A

Cook’s D statistics, RSTUDENT residuals, and DFFITS statistics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What diagnostic statistic do you use to determine which predictor variable is being influenced?

A

DFBETAS (difference in betas)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does it mean if the RSTUDENT value differs from the STUDENT residual?

A

The observation is probably influential

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Which statistic is most useful for identifying influential observations for explanatory models when the purpose of your model is parameter estimation?

A

Cook’s D

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does Cook’s D measure?

A

The Cook’s D statistic measures the distance between the set of parameter estimates with that observation deleted from your regression analysis, and the set of parameter estimates with all the observations in your regression analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How can you determine the influential observations when examining the Cook’s D plot?

A

Look for observations above the horizontal cutoff line.

17
Q

Which statistic is best for identifying influential observations when building a predictive model?

A

DFFITS is most useful for predictive models

18
Q

Which test should you use to determine which predictor variable is being influenced?

A

DFBETAS, which stands for Difference in Betas

19
Q

What is stored in the automatic macro variable _GLSIND generated when running PROC GLMSELECT?

A

The list of effects selected by PROC GLMSELECT

20
Q

What should do you do with influential observations?

A

1 ) Re-check for data entry errors

2) Consider whether or not a you have an adequate model and consider whether or not to change the model to accommodate unusual observations such as adding a categorical predictor to distinguish among groups
3) Determine whether or not the influential observation is valid, and only unusual.

21
Q

What problem do you have when two or more predictor variables are strongly correlated with one another?

A

Collinearity, also called multicollinearity, indicates that you have redundant information in your model.

When multiple variables try to explain the same variation in the response, it leads to inflated standard errors and instability in the regression model.

22
Q

If there is no correlation among the predictor variables, can there still be collinearity in the model?

A

Collinearity occurs when there is correlation present among predictor variables in the model. If the predictor variables are not correlated, then there’s no collinearity present in the model.

23
Q

When you use honest assessment, which of the following would be considered the best model?

A

The best model is the simplest (the most parsimonious) model that has the best performance on the validation data. The training data is used to fit the model and generate the possible models to be assessed.