Test Flashcards
Pearson coefficient of correlation
It measures the strength of the linear relationship between two variables
Multicollinearity
Is a condition that occurs when two or more independent variables are highly correlated
R^2
Measure the percentage of variation in the dependent variable that is explained by the set of all independent variables in the model
Nested model
Two models are said to be nested if one contains all the variables of the other model plus at least one extra variable
Mallows cp
-Popular model selection criteria
-mallows cp is related to adjusted r2 but imposed a penalty for increasing the number of independent variables
- it is called a parsimonious decision criterion
Press stats
Press is based on the leave one out or jacknife technique in which one fits the model without the ith observation xi and uses this fitted model to predict the response when x = xi. The press residuals are defined as e = yi - y hat. The process is repeated for all n observation.
- the lower the value of press the better the predictive model
Predicted r squared
It indicates how well a regression model predicts responses for new observation. This statistics helps you determine when the model fits the original data but is less capable of providing valid predictions for new observations.
Vif
Test for multicollinearity, It quantifies the degree to which the variance of the estimated regression coefficient is increased due to collinearity among the predictor variables.
- if bigger than 5 the model probably has a problem with multicollonearity
-if all vif are less than 1/(1-r2) then multicollinearity is not strong enough ti affect the coefficient estimates.
Heteroscedasticity
Occurs when regression results produce error terms that are of significant’y varying degrees across settings of the independent variables
-variance might be larger has the independent variables gets larger
How to stabilize Heteroscedasticity?
Ln(y)
Square root y
Test for Heteroscedasticity
Divide the sample observation based on the values of y hat or equivalent’y in this example the value of x
We next calculate the variance of the observation in subgroups 1 and 2 and perform a test of hypothesis for the ratio of the variances
F= variance larger/ variance smaller
We look in the F table if test> than table we reject equal variance
Df = number of variable in each variance (larger, smaller)
Anderson-darling
Test for normality
H0= distribution is normal
H1 distribution is not normal
If ad test > 0.05 no reason to conclude that dist not normal
Standardized residuals
The standardized residual denoted z for the ith observation is the residual for the observation e divided by the standard error of the estimate s
If an observation is greater than standardized residual of 3 it is considered an outlier
Cooks D
Cooks d is an overall measure of the impact of the ith observation on the n fitted values. Observation with large d values may be outliers. Because d is calculated using leverage values and standardized residuals, it considers whether an observation is unusual with respect to both x and y values
Calculated percentile is:
- between 0 and 0.30 conclude not influential
- between 0.3 and 0.5 conclude midly influential
-greater than 0.5 conclude influential
Durbin watson
It is used for time series data to detect serial correlation of residuals
-highly positively correlated d=o
-uncorrelated = 2
-negatively correlated =4
-lower tail test:
H0: no residual correlation
H1: positive correlation
•it has to be smaller than Dlower to show evidence of positive correlation
-upper tail test:
-h0: no residual corr
-h1 negative correlation
•rejection region: (4-D) < d lower shows evidence of negative correlation
What contains deseasonalized data
TxCxI
So we have to divide Y/S
Pacf
The partial correlation between two variables is the amount of correlation between those variables which is nit explained by their mutual correlations with a given set of other variables
AIC and BIC
Criteria including a parsimony factor select model having minimum aic and bic
Note: r is the total number of parameters including the constant term
They can be use as techniques for variable selection in regression analysis
Ljung box q stats
Is a test for overall model adequacy it is a sort of lack of fit test.
It belong to a class of test known as portmanteau test
Instead of studying the correlation coefficient rk one at a time the idea is to consider a whole set of rk values for example r1 through r12 all at one time
It test to see if the entire set is significantly different from the zero set
If p value smaller then 0.05 the model is considered inadequate
Bonferroni correction
Is an adjustment made to alpha values when several stats test are being performed simultaneously
We need to divide the alpha by the number of comparisons made
This test is made to reduce the chances of obtaining false positive results (type 1 error). The probability of identifying at least one significant result due to chance increases as more hypotheses are tested
Detecting unequal variances
Hartley test: test stats = maximum variance/ minimum variance, reject ho for large values of the ts
Bartlett’s test: follows a chi-squared distribution with p-1 df
Modified levene’s test: similar to a 1 way anova based on the absolute deviation of the observations of each sample from their medians
If population is normal use bartlett test
If population is not normal use levene’s test
What is autocorrelation
Correlation of a series with its own previous values