Topic 1-4 Flashcards
What is statistical learning?
Statistical learning refers to a set of approaches for estimating the relationship between a response variable Y and predictors X. This relationship is modeled as Y=f(X)+ϵ, where f(X) is the unknown function to be estimated, and ϵ is the error term representing noise.
What are the names of
X and Y?
Y: Response variable (output or dependent variable)
X: Predictors (inputs, features, or independent variables).
Pros/cons of the parametric and non-parametric methods:
Pros: Simple and interpretable; require fewer observations.
Cons: May fail if the assumed model is far from the true f
Non-parametric Methods:
Pros: Can model complex shapes of
f; more flexible.
Cons: Require more data; less interpretable.
For the best prediction performance, which of our tested statistical learning methods should we choose?
Select the method that achieves the lowest test MSE by balancing bias and variance.
What is the relationship between test performance and model flexibility
Test error decreases initially with flexibility (reducing bias) but increases beyond a certain point due to variance (overfitting).
What is the RSE (residual standard error) useful for? And what is the formula?
For quickly seeing how much a prediction is usually off (the average standard error, measured in units of Y. So if RSE = 400, the average measurement is 400 units off.
RSE = Root(1/(n-2) * RSS)
What are the 2 p values associated with multicolinearity?
The F-test and the T-test.
The F-test wether there is any non-0 parameter in the regression, if there is H0(all Bj is 0) is rejected
The T test give the amount of standard errors the parameter is from 0, if the parameter is a lot of std. error from 0, it likely is valuable to the response variable.
Difference between Qualitative and Quantitative predictors
Qualitative = Race etc.
Quantitative = income
With different coding schemes for Qualitative variables, what happens to the regression lines when interaction effects are implemented into the scheme?
They are no longer parallel to one another.
What are assumptions of a simple linear regression?
Linearity: The relationship between
X and Y is linear.
Independence: Observations are independent.
Homoscedasticity: Constant variance of errors (ϵ).
Normality: Errors (ϵ) are normally distributed.
What are the five potential potential problems with a regression model (model diagnostics)?
- Non-linearity
- Non constant variance of error terms
- outliers
- High leverage points
- Collinearity
What is collinearity
2 predictors are very closely related to one another, so its difficult to determine how they seperately affect the response
Why is there a need for model diagnostics?
To see if any of the assumptions of the model are violated
What are RSS, RSE, and Training MSE:
RSS (Residual Sum of Squares): Measures the total deviation of predicted values from actual values.
RSE (Residual Standard Error): An estimate of the standard deviation of ϵ, derived from RSS.
Training MSE: RSS normalized by the number of observations (adjusted for degrees of freedom). It is a measure of training error.
How to asses coeficient accuracy?
Standard errors (SE) of coefficients and their p-values indicate accuracy.