Linear Regression Flashcards

1
Q

What does it mean to say that an estimate is unbiased?

A

It means that it does not systematically over or under estimate the true population parameter. On the basis of one set of observations we might over or under estimate the population value but if we average a huge number of estimates from a huge number of sets of observations then the average should equal exactly the population value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a standard error?

A

The standard error tells us the average amount that an estimate varies from the true population value.

Standard errors are used to compute confidence intervals.

They are also used to perform hypothesis tests on coefficients.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does a 95% confidence interval mean?

A

It is a range of values that with 95% probability contains the true unknown value of the parameter. (seen some pushback on this interpretation since the true population parameter is unknown)

If we take repeated samples and construct a confidence interval for each sample, 95% of intervals will contain the true unknown value of the parameter.

Confidence intervals are often misinterpreted. The logic behind them may be a bit confusing. Remember that when we’re constructing a confidence interval we are estimating a population parameter when we only have data from a sample. We don’t know if our sample statistic is less than, greater than, or approximately equal to the population parameter. And, we don’t know for sure if our confidence interval contains the population parameter or not.

The correct interpretation of a 95% confidence interval is that “we are 95% confident that the population parameter is between X and X.”

https://online.stat.psu.edu/stat200/lesson/4/4.2/4.2.1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does a p-value signify?

A

the chances of observing a significant association purely by chance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is variance inflation factor (VIF)?

A

Variance inflation factor is a measure of the amount of multicollinearity in a set of multiple regression variables. It provides an index that measures how much the variance of an estimated regression coefficient is increased because of collinearity.

It is the ratio of the variance of one parameter in a model with other parameters compared to a model with just the one parameter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

When do you use the t-distribution vs. the z-distribution for confidence intervals?

A

t-distribution is used for smaller samples, usually less than 30

z-distribution is used

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the residual standard error (RSE)?

A

An estimate of the standard deviation of the error term. It is the average amount that the response will deviate from the true regression line.

It is measured in the units of the outcome variable. So an RSE of 25 would mean that actual observations typically deviate from the true regression line by 25 units of the response variable.

Smaller the number the closer the fit to the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does the r-squared statistic capture?

A

Proportion of variability in Y that can be explained using the model; always takes value between 0 and 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a correlation?

A

a measure of the linear relationship between 2 variables.

quantifies the association between a single pair of variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the relationship between correlation and r-squared?

A

The squared correlation and r-squared are identical with only 1 explanatory variable in a regression model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do you estimate regression coefficients in linear regression?

A

least squares approach, minimize the sum of squared residuals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

In simple regression you use a confidence interval with the t-distribution to test the null hypothesis that the regression coefficients are non-zero . What statistic is used to test the null hypothesis with multiple linear regression?

A

F-statistic which indicates that at least one of the predictors is associated with the outcome, given a significant p-value.

In fact, the p-value that we see in multiple regression is using the F-statistic and comparing the entire model to a model with a particular predictor taken out.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Given that we get individual p-values for each variable in multiple regression, why do we need to look at the overall F-statistic?

A

Especially with a large number of predictors, we are likely to see significant relationships between predictor and response purely by chance. for example if we have 100 predictors, we would expect on average for at least 5 predictors to less than 0.05 p-values by chance. However, the f-test does not suffer from this problem since it adjusts for the number of predictors. Hence, if the null hypothesis of all coefficients being zero is true, there is only a 5% chance that the f-statistic will result in a p-value below 0.05.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Steps in multiple regression:

A
  1. F-statistic to determine if at least 1 predictor is associated with the response
  2. Select the proper subset of predictors
  3. Assess model fit - r-squared & RSE
  4. Determine whether model meets assumptions of the analysis
  5. Generate Predictions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the 3 classical approaches to selecting the proper subset of predictor variables?

A
  1. Forward selection - start with nothing and add 1 variable at time that results in the lowest RSS, stop when some rule is satisfied
  2. Backward selection - start with all predictors in model, remove variable with largest p-value and continue till all predictors are below a certain p-value threshold
  3. Mixed selection - combo of forward and backward, start with forward but remove variables if p-values get to high, continue till all predictors have low p-value and all predictors outside the model would have a large p-value if added

Mixed selection generally best

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the additivity and linearity assumption of linear models?

A

Additivity - association between x and y does not depend on values of other predictors

linearity - change in y because of x is constant

17
Q

What is the hierarchical principle? (think interaction models)

A

The hierarchical principle states that if we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant.

The reason is that if the interaction term is significant, it is of little interest whether or not the individual coefficients are zero. You can also alter the meaning of the interaction if you leave those terms out.

18
Q

What is polynomial regression?

A

It is a way to extend linear regression to include non-linear relationships

19
Q

What problems can occur with linear regression models?

A
  1. non-linearity of response-predictor relationships
  2. correlation of error terms (multicollinearity)
  3. non-constant variance of error terms (assume homoscedasticity or homogeneity of variances)
  4. outliers
  5. high-leverage points
  6. collinearity
20
Q

Residual plots are useful to detect what?

A

issues related to the linearity of the model, also heteroscedasticity

21
Q

What is multicollinearity and why is it problematic?

A

It refers to the correlation of error terms. If error terms are correlated, it drives down estimates of the true standard errors. Thus, confidence/prediction intervals will be narrower than they should be and p-values will be lower than they otherwise would be.

Basically it creates more confidence in a model than is warranted.

As an extreme example, imagine that you accidentally doubled your data. The parameters wouldn’t change but now your sample size has doubled which impacts confidence intervals and p-values.

This is commonly seen in time-series data.

It is called multicollinearity because collinearity issues sometimes occur between more than 2 variables.

SE = sd / sqrt(n)

22
Q

What is a studentized residual and what is it used for?

A

Studentized residuals are calculated by dividing each residual by its standard error. These are used to detect outliers in the data. Typcially residuals with a studentized residual larger than 3 are candidates for removal.

23
Q

What is an outlier? What problems do outliers cause in a regression model?

A

Outliers refers to points for which the outcome (Y) is far from the predicted value from a model.

Sometimes outliers don’t matter much for the least squares fit but it does impact the RSE by increasing it. Since the RSE is used to calculate confidence intervals and p-values this can greatly impact the interpretation of a model and its fit.

Can also impact r-squared.

24
Q

What is a high-leverage point? What do they impact?

A

It is an unusual value for a predictor. They tend to greatly impact the least-squares line, much more so than an outlier.

25
Q

What is collinearity? Why is it a problem?

A

When 2 or more predictor variables are closely related to one another. It is a problem because it can be difficult to isolate the effect of 1 predictor on the response if another predictor is always changing with it.

Collinearity reduces accuracy of regression coefficients, thereby causing the standard error of the coefficients to grow. This increases the chance that we will fail reject the null hypothesis that a coefficient is zero.

26
Q

What is a VIF?

A

Variance inflation factor. This refers to the ratio of the variance of a regression coefficient when fit in the full model vs fit on its own. VIF values greater than 5 or 10 are usually problematic.

27
Q

What do you do when you have collinearity problems?

A

Can drop one of the problematic variables since it likely has redundant information with another variable.

Can also combine problematic variables into a single index if it makes sense.