Chapter 6 Flashcards
multi linear reg
What is Multiple Linear Regression?
When a regression model has more than one input
Used to fit a linear relationship between a quantitative dependent variable Y
Linear Regression
most commonly used predictive modelling technique, “best fit line”
Input, Target variables
must be numeric
Best fit
Minimizes the sum of squares if the vertical distance from the data points of the line.
Goodness of Fit
Difference between the predicted and actual values, called residuals
R squared
0 to 1 value - measures the proportion of variance in the target that is explained by the input.
Y
Dependent Variable, aka outcome or response variable
X
Predictors, aka independent or input variables, regressors, covariates
B
B: Coefficients
E
The noise or unexplained part
The data are used to estimate
the coefficients and the variability of the noise
Objectives of fitting a model related to a quantitative outcome
Understanding the relationship between factors (focus of classical stats)
Predicting the outcome of new cases (focus of Data Mining)
Explanatory vs Predictive Modeling
The choice of model is closely tied to which is the goal
Both use a dataset to fit a model (i.e. estimate coefficients)
However, there several differences between the two:
Explanatory fits data closely - Predictive predicts new cases accurately
Explanatory uses entire data set - Predictive splits into partitions
Performance measures:
Explanatory: How well data fits model
Predictive: Predictive accuracy
Explanatory Modeling
Goal: Explain relationship between predictors (explanatory variables) and target
Familiar use of regression in data analysis
Model Goal: Fit the data well and understand the contribution of explanatory variables to the model
“goodness-of-fit”: R2, residual analysis, p-values
Predictive Modeling
Goal: predict target values in other data where we have predictor values, but not target values
Classic data mining context
Model Goal: Optimize predictive accuracy
Train model on training data
Assess performance on validation (hold-out) data
Explaining role of predictors is not primary purpose (but useful)
*You cannot include all the binary dummies
*You cannot include all the binary dummies; in regression this will cause a multicollinearity error.
Other data mining methods can use all the dummies.
Selecting Subsets of Predictors
Goal: Find parsimonious model (the simplest model that performs sufficiently well)
More robust
Higher predictive accuracy
We will assess predictive accuracy on validation data Exhaustive Search = “best subset” Partial Search Algorithms Forward Backward Stepwise
Partial Search Algorithms
Forward
Backward
Stepwise
Exhaustive Search = Best Subset
All possible subsets of predictors assessed (single, pairs, triplets, etc.)
Computationally intensive, not feasible for big data
Judge by “adjusted R2”
Adjusted R2 for the models with 1 predictor, 2 predictors, 3 predictors, etc. (exhaustive search method)
Adjusted R2 rises until you hit 7-8 predictors, then stabilizes, so choose model with 7 predictors, according to the adj R2 criterion
Forward Selection
Start with no predictors
Add them one by one (add the one with largest contribution)
Stop when the addition is not statistically significant
Backward Elimination
Start with all predictors
Successively eliminate least useful predictors one by one
Stop when all remaining predictors have statistically significant contribution
p value
if it is less than 0.05 then it is less likely to occur by chance
Stepwise
Like Forward Selection
Except at each step, also consider dropping non-significant predictors