Linear Regression Flashcards
What does it mean to say that an estimate is unbiased?
It means that it does not systematically over or under estimate the true population parameter. On the basis of one set of observations we might over or under estimate the population value but if we average a huge number of estimates from a huge number of sets of observations then the average should equal exactly the population value.
What is a standard error?
The standard error tells us the average amount that an estimate varies from the true population value.
Standard errors are used to compute confidence intervals.
They are also used to perform hypothesis tests on coefficients.
What does a 95% confidence interval mean?
It is a range of values that with 95% probability contains the true unknown value of the parameter. (seen some pushback on this interpretation since the true population parameter is unknown)
If we take repeated samples and construct a confidence interval for each sample, 95% of intervals will contain the true unknown value of the parameter.
Confidence intervals are often misinterpreted. The logic behind them may be a bit confusing. Remember that when we’re constructing a confidence interval we are estimating a population parameter when we only have data from a sample. We don’t know if our sample statistic is less than, greater than, or approximately equal to the population parameter. And, we don’t know for sure if our confidence interval contains the population parameter or not.
The correct interpretation of a 95% confidence interval is that “we are 95% confident that the population parameter is between X and X.”
https://online.stat.psu.edu/stat200/lesson/4/4.2/4.2.1
What does a p-value signify?
the chances of observing a significant association purely by chance
What is variance inflation factor (VIF)?
Variance inflation factor is a measure of the amount of multicollinearity in a set of multiple regression variables. It provides an index that measures how much the variance of an estimated regression coefficient is increased because of collinearity.
It is the ratio of the variance of one parameter in a model with other parameters compared to a model with just the one parameter.
When do you use the t-distribution vs. the z-distribution for confidence intervals?
t-distribution is used for smaller samples, usually less than 30
z-distribution is used
What is the residual standard error (RSE)?
An estimate of the standard deviation of the error term. It is the average amount that the response will deviate from the true regression line.
It is measured in the units of the outcome variable. So an RSE of 25 would mean that actual observations typically deviate from the true regression line by 25 units of the response variable.
Smaller the number the closer the fit to the data
What does the r-squared statistic capture?
Proportion of variability in Y that can be explained using the model; always takes value between 0 and 1.
What is a correlation?
a measure of the linear relationship between 2 variables.
quantifies the association between a single pair of variables
What is the relationship between correlation and r-squared?
The squared correlation and r-squared are identical with only 1 explanatory variable in a regression model.
How do you estimate regression coefficients in linear regression?
least squares approach, minimize the sum of squared residuals
In simple regression you use a confidence interval with the t-distribution to test the null hypothesis that the regression coefficients are non-zero . What statistic is used to test the null hypothesis with multiple linear regression?
F-statistic which indicates that at least one of the predictors is associated with the outcome, given a significant p-value.
In fact, the p-value that we see in multiple regression is using the F-statistic and comparing the entire model to a model with a particular predictor taken out.
Given that we get individual p-values for each variable in multiple regression, why do we need to look at the overall F-statistic?
Especially with a large number of predictors, we are likely to see significant relationships between predictor and response purely by chance. for example if we have 100 predictors, we would expect on average for at least 5 predictors to less than 0.05 p-values by chance. However, the f-test does not suffer from this problem since it adjusts for the number of predictors. Hence, if the null hypothesis of all coefficients being zero is true, there is only a 5% chance that the f-statistic will result in a p-value below 0.05.
Steps in multiple regression:
- F-statistic to determine if at least 1 predictor is associated with the response
- Select the proper subset of predictors
- Assess model fit - r-squared & RSE
- Determine whether model meets assumptions of the analysis
- Generate Predictions
What are the 3 classical approaches to selecting the proper subset of predictor variables?
- Forward selection - start with nothing and add 1 variable at time that results in the lowest RSS, stop when some rule is satisfied
- Backward selection - start with all predictors in model, remove variable with largest p-value and continue till all predictors are below a certain p-value threshold
- Mixed selection - combo of forward and backward, start with forward but remove variables if p-values get to high, continue till all predictors have low p-value and all predictors outside the model would have a large p-value if added
Mixed selection generally best