stats-linear regression Flashcards
for statsmodel regression, a high p-value (eg, > 0.5) suggest?
the intercept coefficient is not significantly different from zero, the variable holds little predictive power
for statsmodel regression, a low p-value (eg, < 0.05) suggest?
the intercept coefficient is significantly different from zero, the variable holds predictive power
regression: sum of squares total
sum of ((dependent variable - dependent variable mean)^2)….. measures the total variability of the dataset (variance)
regression: sum of squares regression
sum of ((predicted value - mean)^2)… measures the explained variability by the regression line
regression: sum of squares error
sum of ((observed value - predicted value)^2)… measures the unexplained variability by the regression line. “error”
regression: relationship among SST, SSE and SSR
SST = SSE + SSR. Total variability = explained variability + unexplained variability
OLS
ordinary least square, aims to minimize SSE
R-squared
SSR/SST: variability explained / total variability. in [0, 1]. A higher R-squared means a better regression model
adjusted R-squared
always < R-squared because it penalizes excessive use of variables.
when to drop an independent variable
When adjusted R-squared is lowered. F-statistics is lowered. The pvalue for that variable is high.
F-statistics
testing overall significance of the model. A higher F-statistics means better model. prob(F-statistics) is the pvalue for the F-statistics, and should be close to zero for a good model. It tests the null hypothesis that betas = 0