week 5 DSE Flashcards
What is the disadvantage of running simple linear regressions separately instead of multiple?
ignores potential confounding factors or
synergy effect, leading to misleading results
If there are 2 or more independent vairbales, how to find error?
use least squares to find the regression plane that best fits the data.
If there are 2 predictors, how many parameters are we supposed to estimate
3
p+1
What is the eror term of the ith point?
ϵi = yi − yˆi
What does b2 =188 mean?
Holding the expenditure on x1 constant, every increase in x2 by 1 unit increases sales by 188 units
How many degrees of freedom does RSE have for multiple linear model
n-(p+1)
What is R^2?
fraction of variance explained
What is a flaw of R^2 in multilinear regression?
value never decreases, even if we add redundant variables to the regression model.
What causes the flaw in R^2 in multilinear regression?
equation solves for the coefficients such that RSS are minimized
If the variable does not improves model fit, the estimated coefficient will be zero. BUT R2 remains unchanged.
If the variable improves model fit, the estimated coefficient will be nonzero ⇒ R2 increases.
r^2 cannot decrease, can only remain same or increase
Use adjusted R^2
Can adjusted R^2 be negative?
Yes
What is the formula for penalization factor?
It is always ______
(n-1)/ ( n-(p+1) )
It is always larger than 1
How does R^2 adjusted compare to R^2?
- always smaller, can be negative
- when new independent variable added, r^2 decreases
INCLUDES PENALIZATION FACTOR
What is H0 and H1 for multiple lienar regression?
H0 : β1 = β2 = … = βp = 0
H1 : at least one βj is not zero.
What test for multi regerssion hypothesis testing?
F-statistic
refer to P-VALUE of f stats from R output table
What is interpolation?
Predicting Y for a value of X that is within the range of the original data
What is extrapolation
Predicting Y for a value of X that is outside the range
How does overfitting arise?
Using too many variables or too complex of a model can often lead to overfitting
How do you know when you are overfitting?
Wont be able to predict any other new observatio
Only able ot fit the current data perfectly
What is the problem of collinearity in multi linear regres?
We can only hold factors constant figuratively.
However, when two or more Xs are highly correlated, we can’t even hold factors constant figuratively
What are 2 solution to collinearity
Remove the redundant variable. You will need subject knowledge to understand which one is redundant.
Combine the collinear variables into a single variable
How to write operations within linear regression model equations in R?
use I
lm_mpg2 = lm(mpg ~ horsepower + I(horsepowerˆ2), data = Auto)
If x1= a and x2= a^2, is it still a linear model? Why?
Yes. Linear in coefficients
How to exclude variables in R?
use .-
eg lm(mpg ~ . - name, ..)
.- means everything EXCEPT FOR NAME
If you have a categorical variable, how many of coefficients for them will you see in R?
n-1
If you figure out 2, you can figure out hte last 1. ( If first 2 are 0, last one must be 1)
How to add labels to nominal categorical data?
Auto$origin = factor(Auto$origin,
labels = c(“American”, “European”, “Japanese”))
How to test whether there is a relationship between Y and any of the Xs?
look at the F-statistics and its p-value.
Reject h0 if p value small
What are some modelling guidelines?
Start from simple to complex model
Make sure you understand what the parameters mean, especially after using transformations like log(Y ).
After fixing the initial model, add other variables one at a time, or by logical groups like demographics
Watch out for signs of collinearity (high or perfect correlation, large changes in estimated coefficients AFTER ADDING VARIABLE!!!, reversed signs, etc
s long as you know that conceptually a variable should be in the model, then the variable should be in the model
Run model diagnostics(make sure assumptions met, may need to try polynomial)
Check for interactions
Sometimes simplicity of presentation is preferred to a better fitting model
Should you include a variable in a model if it is not statistically significant?
Yes, as long as you know that conceptually a variable should be in the model, then the variable should be in the model
If differnece is major, is simplicity preferred to better fitting
NO!!