Topic 1-4 Flashcards by Wietse van Giffen

What is statistical learning?

Statistical learning refers to a set of approaches for estimating the relationship between a response variable Y and predictors X. This relationship is modeled as Y=f(X)+ϵ, where f(X) is the unknown function to be estimated, and ϵ is the error term representing noise.

How well did you know this?

Not at all

Perfectly

What are the names of
X and Y?

Y: Response variable (output or dependent variable)
X: Predictors (inputs, features, or independent variables).

How well did you know this?

Not at all

Perfectly

Pros/cons of the parametric and non-parametric methods:

Pros: Simple and interpretable; require fewer observations.
Cons: May fail if the assumed model is far from the true f
Non-parametric Methods:
Pros: Can model complex shapes of
f; more flexible.
Cons: Require more data; less interpretable.

How well did you know this?

Not at all

Perfectly

For the best prediction performance, which of our tested statistical learning methods should we choose?

Select the method that achieves the lowest test MSE by balancing bias and variance.

How well did you know this?

Not at all

Perfectly

What is the relationship between test performance and model flexibility

Test error decreases initially with flexibility (reducing bias) but increases beyond a certain point due to variance (overfitting).

How well did you know this?

Not at all

Perfectly

What is the RSE (residual standard error) useful for? And what is the formula?

For quickly seeing how much a prediction is usually off (the average standard error, measured in units of Y. So if RSE = 400, the average measurement is 400 units off.

RSE = Root(1/(n-2) * RSS)

How well did you know this?

Not at all

Perfectly

What are the 2 p values associated with multicolinearity?

The F-test and the T-test.
The F-test wether there is any non-0 parameter in the regression, if there is H0(all Bj is 0) is rejected
The T test give the amount of standard errors the parameter is from 0, if the parameter is a lot of std. error from 0, it likely is valuable to the response variable.

How well did you know this?

Not at all

Perfectly

Difference between Qualitative and Quantitative predictors

Qualitative = Race etc.
Quantitative = income

How well did you know this?

Not at all

Perfectly

With different coding schemes for Qualitative variables, what happens to the regression lines when interaction effects are implemented into the scheme?

They are no longer parallel to one another.

How well did you know this?

Not at all

Perfectly

What are assumptions of a simple linear regression?

Linearity: The relationship between
X and Y is linear.
Independence: Observations are independent.
Homoscedasticity: Constant variance of errors (ϵ).
Normality: Errors (ϵ) are normally distributed.

How well did you know this?

Not at all

Perfectly

What are the five potential potential problems with a regression model (model diagnostics)?

Non-linearity
Non constant variance of error terms
outliers
High leverage points
Collinearity

How well did you know this?

Not at all

Perfectly

What is collinearity

2 predictors are very closely related to one another, so its difficult to determine how they seperately affect the response

How well did you know this?

Not at all

Perfectly

Why is there a need for model diagnostics?

To see if any of the assumptions of the model are violated

How well did you know this?

Not at all

Perfectly

What are RSS, RSE, and Training MSE:

RSS (Residual Sum of Squares): Measures the total deviation of predicted values from actual values.
RSE (Residual Standard Error): An estimate of the standard deviation of ϵ, derived from RSS.
Training MSE: RSS normalized by the number of observations (adjusted for degrees of freedom). It is a measure of training error.

How well did you know this?

Not at all

Perfectly

How to asses coeficient accuracy?

Standard errors (SE) of coefficients and their p-values indicate accuracy.

How well did you know this?

Not at all

Perfectly

How can you check if the model fitting is good?

Study These Flashcards

RSS and RSE

What are assumptions of a multiple linear regression?

Study These Flashcards

Linearity: The relationship between
X and Y is linear.
Independence: Observations are independent.
Homoscedasticity: Constant variance of errors (ϵ).
Normality: Errors (ϵ) are normally distributed.
No multicollinearity among predictors

Inclusion of all relevant predictors

What are solutions to the 5 problems a regression model might have?
- Non-linearity
- Non constant variance of error terms
- outliers
- High leverage points
- Collinearity

Study These Flashcards

Use nonlinear transformations of predictors
Transform the response variable (y) to something nonlinear (log(Y))
Check studentized residuals and remove plot
Limit the values of X
Use VIF to see the severity of collinearity, drop one of the predictors

What are the assumptions of LDA?

Study These Flashcards

The predictor variable is normally distributed under each response class

If there is more than one predictor, the predictors follow a multivariate normal distribution

All predictors have equal variance

Why is the Bayes classifier the gold standard to compare other classifiers against?

Study These Flashcards

Because it assumes perfect normality in the predictors, in the real world we do not know the distribution of our predictors. So the bayes classifer is purely theoretical

What are the assumptions of Naive Bayes?

Study These Flashcards

Within each class, the predictors are independent. This means that for each class you get a Function that is built up of functions for each predictor. Fk(x) = Fk1(x1) + Fk2(x2) + Fkp(xp)
k=class

What is the main difference between LDA and QDA?

Study These Flashcards

LDA assumes the predictor has a Class-specific mean and a shared variance. QDA assumes that the predictor has a class specific mean and a class specific variance.

What are the performance measures for all classifiers?

Study These Flashcards

Accuracy = (TN +TP) / (N + P)
Error rate = (FN +FP) / (N + P)
Sensitivity = TP / P
Specificity = TN / N

What are the 2 extra performance classifiers for LDA and QDA? (Explain them in your head as well)

Study These Flashcards

They are called the:
-ROC (receiver operating characteristics)
-AUC (Area under curve)

Both based on a curve Y-axis True positive rate (sensitivity = TP/P) and the false positive rate (1-specificity) .

How to find the optimal K-value with KNN?

Cross validation

What are the three CV methods?

- K-fold CV - LeaveOneOut CV - Validation Set approach

What are the two Scenarios for Resampling Methods?

Model assessment: Estimating the test error rate when a test set is unavailable. Model selection: Selecting the model with an appropriate level of flexibility.

List the pro's and cons for each CV method

Validation Set: Pros: Simple and quick. Cons: High variance due to randomness in splitting data; training on fewer observations can lead to overestimation of the test error. LOOCV: Pros: Uses nearly all the data for training; reduces bias. Cons: Computationally intensive; requires refitting the model n times. K-Fold Cross Validation: Pros: Balances bias and variance well; computationally efficient compared to LOOCV. Cons: Still computationally expensive with very large datasets.

Topic 1-4 Flashcards

(28 cards)