Statistics with SAS on Coursera: Week 4 Flashcards
Study materials associated with Week 4 of the Statistics with SAS course offered on Coursera.
What are the assumptions when performing a regression analysis?
1) A linear relationship fits the data adequately
2) The errors are normally distributed with a mean of 0
3) The errors have equal variance at each value of the predictor value
4) The errors are independent
What should the scatter plot look like when testing that a linear model fits the data adequately?
The data should hover around the regression line.
Check for possible outliers influencing the slope of the line as well as non-linear patterns in the data such as curvilinear relationships or autocorrelated data common in time series data.
What is a residual?
A residual is the the difference between each observed value of Y and its predicted value.
How can you check for violations of equal variances, linearity, and independence?
Examine the shape of the scatter in the residuals versus predicted values chart.
You want to see a random scatter of the residual values above and below the reference line at 0.
No patterns should be visible in the residuals.
This indicates that the model assumptions are valid.
What does it mean when you see pattern or trends in the residual values?
The linear regression models are not valid. The model might have problems.
What does the Q-Q plot look like when the residuals are normally distributed?
The Q-Q plot should appear to be a straight, diagonal line if the residuals are normally distributed,
Given the properties of the standard normal distribution, between which two values would approximately 95% of the studentized residuals fall?
-2 and 2
If we think of these STUDENT residuals as following the standard normal distribution and apply the 68/95/99% rule, we would expect 5% of them to fall beyond the -2, +2 limits, by chance.
What’s the difference between an outlier and an influential observation?
An outlier is an unusual data point, whereas an influential observation is an unusual data point that singlehandedly exerts influence on the regression model.
What might parts of your model might be affected by influential observations?
Influential observations could affect the model coefficients, the standard errors, or the predicted values.
For example, if deleting an observation results in a large change in parameter estimates, then that observation has a significant influence on the parameters.
If deleting an observation results in a change in the standard errors, then the observation influences the precision of the parameters.
Which diagnostic statistic may be used to detect outliers?
The STUDENT residuals (also known as studentized or standardized residuals)
What diagnostic statistics may be used to detect influential observations?
Cook’s D statistics, RSTUDENT residuals, and DFFITS statistics.
What diagnostic statistic do you use to determine which predictor variable is being influenced?
DFBETAS (difference in betas)
What does it mean if the RSTUDENT value differs from the STUDENT residual?
The observation is probably influential
Which statistic is most useful for identifying influential observations for explanatory models when the purpose of your model is parameter estimation?
Cook’s D
What does Cook’s D measure?
The Cook’s D statistic measures the distance between the set of parameter estimates with that observation deleted from your regression analysis, and the set of parameter estimates with all the observations in your regression analysis.
How can you determine the influential observations when examining the Cook’s D plot?
Look for observations above the horizontal cutoff line.
Which statistic is best for identifying influential observations when building a predictive model?
DFFITS is most useful for predictive models
Which test should you use to determine which predictor variable is being influenced?
DFBETAS, which stands for Difference in Betas
What is stored in the automatic macro variable _GLSIND generated when running PROC GLMSELECT?
The list of effects selected by PROC GLMSELECT
What should do you do with influential observations?
1 ) Re-check for data entry errors
2) Consider whether or not a you have an adequate model and consider whether or not to change the model to accommodate unusual observations such as adding a categorical predictor to distinguish among groups
3) Determine whether or not the influential observation is valid, and only unusual.
What problem do you have when two or more predictor variables are strongly correlated with one another?
Collinearity, also called multicollinearity, indicates that you have redundant information in your model.
When multiple variables try to explain the same variation in the response, it leads to inflated standard errors and instability in the regression model.
If there is no correlation among the predictor variables, can there still be collinearity in the model?
Collinearity occurs when there is correlation present among predictor variables in the model. If the predictor variables are not correlated, then there’s no collinearity present in the model.
When you use honest assessment, which of the following would be considered the best model?
The best model is the simplest (the most parsimonious) model that has the best performance on the validation data. The training data is used to fit the model and generate the possible models to be assessed.