Linear Regression Flashcards

Question 1

Q

Potential Problem with Linear Regression

Answer

A

Non Linearity 2.Correlation of Error terms 3.Non Constant variance 4.Outliers, 5.High-Leverage 6. Colinearity

Question 2

Q

Residual Plots in Linear Regression

Answer

A

(1) if there is a shape of the residual that is non-linear, suggests a non linear relationship in the data. (2) Also, can you residual plots to detect hetroscedasticity in the dataset, which will look like a funnel (3) Also look for tracking in the residuals, which may occur if error terms are correlated - think time series data

Question 3

Q

2 assumptions of linear model

Answer

A

(1) Additive and (2) Linear

Question 4

Q

Outliers in Linear Regression

Answer

A

Can skew the model, may want to remove these. Also there is a statistic called stdentized residual, for which a value greater than 3 suggests possible outliers.

Question 5

Q

Levarge in Linear Regression

Answer

A

When there is an extreme X value, can compute a leverage statistic.

Question 6

Q

Will Correlation Matrix Detect Colinearity

Answer

A

Yes, but not all cases. It is possible for combinations of vars to have colinearity. There is a statistic called the Variance Inflation Factor VIF. VIF > 5 indicates colinearity.

Question 7

Q

Variance Inflation Factor

Answer

A

VIF > 5 indicates colinearity

Question 8

Q

Studentized Residual

Answer

A

helps detect outliers in Y, value > 3 suggests outliers.

Question 9

Q

Leverage Statistic

Answer

A

If exceeds (p + 1) / n then the corresponding point has high leverage.

Question 10

Q

Possible Solutions for colinearity

Answer

A

(1) drop one of the vars (2) combine the two, like take the average

Question 11

Q

Bias vs Variance Tradeoff

Answer

A

Variance - tendency to overfit. Bias - Accuracy of Model

Question 12

Q

Parametric vs. Non Paramteric Model Accuracy

Answer

A

Parametric will tend to outpeform NP when there is small N/P because of high dimensionality. Think about what high dimensionality does to KNN

Question 13

Q

When order doesn’t matter permutations or combinations? Formula for each

Answer

A

Order doesnt' matter = combination
Combination:
P choose K = P! / K! (P-K)!
Permutation:
P choose K = P! / (P-K)!

Question 14

Q

Confusion Matrix

Answer

A

Y axis (left): Predicted Status
X axis (top): Actual or True Status

Question 15

Q

Sensitivity vs Specificity

Answer

A

Sensitivity: % of true positives caught.
Specificty: % of non-positives caught.

Question 16

Q

ROC Curve

Answer

Study These Flashcards

A

Y axis: True Positives, (Sensitivity)
X axis: False Positives (1 - Specificity)

Want to catch as many True positives without any false positives. Want to catch lots of fish but no dolphins.
There is a diagonal line, this reference assigning classification by chance. Ex) randomly select 20% of population as guilty, then you should caught identified 20% of the True Positives by chance, and out of the remaining population 20% would have been identified as guilty by chance also.
Want to maximize the area under the ROC curve

Question 17

Q

Compare the following classification methods:

Logistic Regression
LDA
QDA
KNN

Answer

Study These Flashcards

A

LR and LDA assume linear decision boundaries, LDA assumes that observations are drawn from normal distribution with common co-variance matrix in each class, so it will outpeform LR if this is the case, otherwise LR will win.

QDA is a quadratic decision boundary, but is still parametric. More flexible decision boundary possible. Since its parametric, preforms well when N is small. This has the opposite assumption than LDA where you are looking for different correlations b/w predictors in each class. Or, each class has a different co-variance matrix. Also, this is a tradeoff b/w being linear and KNN - “moderately non linear”.

KNN is completely non-parametric, can find extremely non linear boundaries.

Question 18

Q

Curse of Dimensionality

Answer

Study These Flashcards

A

When p is large, there tends to be a deterioration in the performance of KNN and other local approaches that perform prediction using only observations near the test observation. Non-parametric approaches perform poorly when P is large. Also, with local approaches, when there is a large number of dimensions you are using a very small fraction of the available observations to make the prediction.

Linear Regression Flashcards

(18 cards)