Linear Regression Flashcards
Potential Problem with Linear Regression
- Non Linearity 2.Correlation of Error terms 3.Non Constant variance 4.Outliers, 5.High-Leverage 6. Colinearity
Residual Plots in Linear Regression
(1) if there is a shape of the residual that is non-linear, suggests a non linear relationship in the data. (2) Also, can you residual plots to detect hetroscedasticity in the dataset, which will look like a funnel (3) Also look for tracking in the residuals, which may occur if error terms are correlated - think time series data
2 assumptions of linear model
(1) Additive and (2) Linear
Outliers in Linear Regression
Can skew the model, may want to remove these. Also there is a statistic called stdentized residual, for which a value greater than 3 suggests possible outliers.
Levarge in Linear Regression
When there is an extreme X value, can compute a leverage statistic.
Will Correlation Matrix Detect Colinearity
Yes, but not all cases. It is possible for combinations of vars to have colinearity. There is a statistic called the Variance Inflation Factor VIF. VIF > 5 indicates colinearity.
Variance Inflation Factor
VIF > 5 indicates colinearity
Studentized Residual
helps detect outliers in Y, value > 3 suggests outliers.
Leverage Statistic
If exceeds (p + 1) / n then the corresponding point has high leverage.
Possible Solutions for colinearity
(1) drop one of the vars (2) combine the two, like take the average
Bias vs Variance Tradeoff
Variance - tendency to overfit. Bias - Accuracy of Model
Parametric vs. Non Paramteric Model Accuracy
Parametric will tend to outpeform NP when there is small N/P because of high dimensionality. Think about what high dimensionality does to KNN
When order doesn’t matter permutations or combinations? Formula for each
Order doesnt' matter = combination Combination: P choose K = P! / K! (P-K)! Permutation: P choose K = P! / (P-K)!
Confusion Matrix
Y axis (left): Predicted Status X axis (top): Actual or True Status
Sensitivity vs Specificity
Sensitivity: % of true positives caught.
Specificty: % of non-positives caught.
ROC Curve
Y axis: True Positives, (Sensitivity)
X axis: False Positives (1 - Specificity)
Want to catch as many True positives without any false positives. Want to catch lots of fish but no dolphins.
There is a diagonal line, this reference assigning classification by chance. Ex) randomly select 20% of population as guilty, then you should caught identified 20% of the True Positives by chance, and out of the remaining population 20% would have been identified as guilty by chance also.
Want to maximize the area under the ROC curve
Compare the following classification methods:
- Logistic Regression
- LDA
- QDA
- KNN
LR and LDA assume linear decision boundaries, LDA assumes that observations are drawn from normal distribution with common co-variance matrix in each class, so it will outpeform LR if this is the case, otherwise LR will win.
QDA is a quadratic decision boundary, but is still parametric. More flexible decision boundary possible. Since its parametric, preforms well when N is small. This has the opposite assumption than LDA where you are looking for different correlations b/w predictors in each class. Or, each class has a different co-variance matrix. Also, this is a tradeoff b/w being linear and KNN - “moderately non linear”.
KNN is completely non-parametric, can find extremely non linear boundaries.
Curse of Dimensionality
When p is large, there tends to be a deterioration in the performance of KNN and other local approaches that perform prediction using only observations near the test observation. Non-parametric approaches perform poorly when P is large. Also, with local approaches, when there is a large number of dimensions you are using a very small fraction of the available observations to make the prediction.