Topic 1-4 Flashcards
What is statistical learning?
Statistical learning refers to a set of approaches for estimating the relationship between a response variable Y and predictors X. This relationship is modeled as Y=f(X)+ϵ, where f(X) is the unknown function to be estimated, and ϵ is the error term representing noise.
What are the names of
X and Y?
Y: Response variable (output or dependent variable)
X: Predictors (inputs, features, or independent variables).
Pros/cons of the parametric and non-parametric methods:
Pros: Simple and interpretable; require fewer observations.
Cons: May fail if the assumed model is far from the true f
Non-parametric Methods:
Pros: Can model complex shapes of
f; more flexible.
Cons: Require more data; less interpretable.
For the best prediction performance, which of our tested statistical learning methods should we choose?
Select the method that achieves the lowest test MSE by balancing bias and variance.
What is the relationship between test performance and model flexibility
Test error decreases initially with flexibility (reducing bias) but increases beyond a certain point due to variance (overfitting).
What is the RSE (residual standard error) useful for? And what is the formula?
For quickly seeing how much a prediction is usually off (the average standard error, measured in units of Y. So if RSE = 400, the average measurement is 400 units off.
RSE = Root(1/(n-2) * RSS)
What are the 2 p values associated with multicolinearity?
The F-test and the T-test.
The F-test wether there is any non-0 parameter in the regression, if there is H0(all Bj is 0) is rejected
The T test give the amount of standard errors the parameter is from 0, if the parameter is a lot of std. error from 0, it likely is valuable to the response variable.
Difference between Qualitative and Quantitative predictors
Qualitative = Race etc.
Quantitative = income
With different coding schemes for Qualitative variables, what happens to the regression lines when interaction effects are implemented into the scheme?
They are no longer parallel to one another.
What are assumptions of a simple linear regression?
Linearity: The relationship between
X and Y is linear.
Independence: Observations are independent.
Homoscedasticity: Constant variance of errors (ϵ).
Normality: Errors (ϵ) are normally distributed.
What are the five potential potential problems with a regression model (model diagnostics)?
- Non-linearity
- Non constant variance of error terms
- outliers
- High leverage points
- Collinearity
What is collinearity
2 predictors are very closely related to one another, so its difficult to determine how they seperately affect the response
Why is there a need for model diagnostics?
To see if any of the assumptions of the model are violated
What are RSS, RSE, and Training MSE:
RSS (Residual Sum of Squares): Measures the total deviation of predicted values from actual values.
RSE (Residual Standard Error): An estimate of the standard deviation of ϵ, derived from RSS.
Training MSE: RSS normalized by the number of observations (adjusted for degrees of freedom). It is a measure of training error.
How to asses coeficient accuracy?
Standard errors (SE) of coefficients and their p-values indicate accuracy.
How can you check if the model fitting is good?
RSS and RSE
What are assumptions of a multiple linear regression?
Linearity: The relationship between
X and Y is linear.
Independence: Observations are independent.
Homoscedasticity: Constant variance of errors (ϵ).
Normality: Errors (ϵ) are normally distributed.
No multicollinearity among predictors
Inclusion of all relevant predictors
What are solutions to the 5 problems a regression model might have?
- Non-linearity
- Non constant variance of error terms
- outliers
- High leverage points
- Collinearity
- Use nonlinear transformations of predictors
- Transform the response variable (y) to something nonlinear (log(Y))
- Check studentized residuals and remove plot
- Limit the values of X
- Use VIF to see the severity of collinearity, drop one of the predictors
What are the assumptions of LDA?
The predictor variable is normally distributed under each response class
If there is more than one predictor, the predictors follow a multivariate normal distribution
All predictors have equal variance
Why is the Bayes classifier the gold standard to compare other classifiers against?
Because it assumes perfect normality in the predictors, in the real world we do not know the distribution of our predictors. So the bayes classifer is purely theoretical
What are the assumptions of Naive Bayes?
Within each class, the predictors are independent. This means that for each class you get a Function that is built up of functions for each predictor. Fk(x) = Fk1(x1) + Fk2(x2) + Fkp(xp)
k=class
What is the main difference between LDA and QDA?
LDA assumes the predictor has a Class-specific mean and a shared variance. QDA assumes that the predictor has a class specific mean and a class specific variance.
What are the performance measures for all classifiers?
Accuracy = (TN +TP) / (N + P)
Error rate = (FN +FP) / (N + P)
Sensitivity = TP / P
Specificity = TN / N
What are the 2 extra performance classifiers for LDA and QDA? (Explain them in your head as well)
They are called the:
-ROC (receiver operating characteristics)
-AUC (Area under curve)
Both based on a curve Y-axis True positive rate (sensitivity = TP/P) and the false positive rate (1-specificity) .
How to find the optimal K-value with KNN?
Cross validation
What are the three CV methods?
- K-fold CV
- LeaveOneOut CV
- Validation Set approach
What are the two Scenarios for Resampling Methods?
Model assessment: Estimating the test error rate when a test set is unavailable.
Model selection: Selecting the model with an appropriate level of flexibility.
List the pro’s and cons for each CV method
Validation Set:
Pros: Simple and quick.
Cons: High variance due to randomness in splitting data; training on fewer observations can lead to overestimation of the test error.
LOOCV:
Pros: Uses nearly all the data for training; reduces bias.
Cons: Computationally intensive; requires refitting the model n times.
K-Fold Cross Validation:
Pros: Balances bias and variance well; computationally efficient compared to LOOCV.
Cons: Still computationally expensive with very large datasets.