Chapter 3 - Linear Models Flashcards

Question

Methods to modify a linear model to handle more complex, non-linear relationships of numeric predictors (3; 3.2.3)

Answer 1

1) Polynomial regression - expanding the regression function to higher powers of X a) i.e. X^2, ..., X^m are treated as new, separate features, all with regression coefficients b) Pros - able to take care of substantially more complex relationships between the target variable and predictors c) Cons - regression coefficients are more difficult to interpret; no simple way to choose the value of m (although unusual to use m larger than 3 or 4) 2) Binning (using piecewise constant functions) - bin (or band) the numeric variable and convert it into an ordered categorical variables whose levels are defined as non-overlapping intervals over the range of the original variable a) Pros - Liberates the regression function from assuming any particular shape, allowing the target mean to vary irregularly over the bins. The larger the number of bins used, the wider the variety of relationships between the target variable and the original numeric predictor (and the more flexible the model). b1) Con - No simple rule to create bins or how the associated boundaries should be selected (introduction of hyperparameters) b2) Con - Results in a loss of information, ignoring the variation of the target variable within each band. b3) Small changes in X (that result in switching bins) may lead to an abrupt change in the target mean 3) Using piecewise linear functions - i.e., B1X becomes B1X + B2(X - c)+, where (X - c)+ = max(0, X - c) (call payoff functions) a) The regression function is linear over each of the two intervals (broken at c), but the slope changes at c b) Pros - simple, but powerful, with less of the drawbacks of binning. Can be easily interpreted. c) Cons - the break points must be user-specified in advance (hyperparameters) (same problem in all three options)

Answer 2

1) Binarization - feature generation where a categorical variable is split into a collection of binary dummy variables, each of which serves as an indicator of only one level of the categorical predictor. a) Advantages - binarizing before evaluating data allows for individual levels to be retained/dropped, as opposed to an 'all-or-nothing' approach b) Disadvantages - increases computational time, may lead to non-intuitive or nonsensical results if only a handful of levels of a categorical predictor are retained 2) Baseline level - for a categorical variable with r levels, r-1 dummy variables will be generated, with the excluded level becoming the baseline level. a) The intercept represents the mean of the target variable at the baseline level b) Baseline level will often be the most populous level or the level that makes the most inherent sense to be included as the default

Answer 3

1) Definition - an interaction arises if the association between one predictor and the target variable depends on the value of another predictor 2) Interactions between two numeric predictors - if X1 and X2 interact, an interaction variable X1*X2 can be added with a separate regression coefficient 3) Interactions between numeric and categorical predictors - a similar X1*X2 variable can be added with a separate regression coefficient 4) Interactions between two categorical predictors - Same thing, add X1*X2

Answer 4

1) Definition - two variables are collinear if one is approximately a multiple of the other 2) Problems a) Variance inflation - coefficient estimates may exhibit high variance, which can lead to counter-intuitive, nonsensical model results (e.g., wildly large positive coefficient for one feature and similarly large negative coefficient for another feature) b) Interpretation of coefficients - cannot interpret coefficient on one feature when others are held constant, the collinear features will move together 3) Detecting collinearity - can look at the correlation matrix of the numeric predictors. An element of this matrix that is close to 1 or -1 is an indication that there is a pair of highly correlated predictors. 4) Solutions a) Delete one of the problematic predictors causing collinearity b) Pre-process the data using dimension reduction techniques, which combine the collinear predictors into a much smaller number of predictors which are far less correlated

Answer 5

1) Best subset selection - involves fitting a separate linear model for each possible combination of the available features and selecting the model which fares best according to a pre-specified criterion (such as AIC or BIC) a) Requires 2^p models for p predictors. Therefore infeasible when p >= 20 due to the large search space (2^20 > 1 million possible models) 2) Stepwise selection - stepwise selection algorithms determine the best model from a carefully restricted list of candidate models by sequentially adding or dropping features, one at a time a) Backward selection - Start with the model with all features, drop the feature that causes (in its absence) the greatest improvement in the model. Continue until no more features can be dropped. b) Forward selection - start with the model with just the intercept and add the feature that improves the model the most. Continue when no more features can be added that improve the model. c) Forward selection is more likely to get a simpler model because the starting model is much simpler d) Maximum # of linear models to fit is 1 + (p * (p + 1))/2

Answer 6

1) Definition - alternative to stepwise selection for choosing features and reducing the complexity of a linear model 2) Process - consider a single model hosting all of the potentially useful features and fit the model using unconventional techniques that regularize, or shrink, the coefficient estimates towards zero 3) Formula - goal is to minimize the following formula: SUM[ Yi - (B0 + B1X1 + ... + BpXp) ]^2 + lambda * fR(B) lambda >= 0 is the regularization parameter that controls the extent of regularization and quantifies our preference for simpler models fR(B) is the penalty function that captures the size of the regression coefficients 4) Common choices of penalty function a) Ridge regression - sum of squares of the slope coefficients (but not intercept!) b) Lasso - sum of absolute values of the slope coefficients c) Elastic net - (1-a) * sum of squares + a * sum of absolute values, where a is the mixing coefficient 5) Lasso has the effect of forcing the coefficient estimates to exactly zero when lambda is sufficiently large, whereas coefficients are reduced, but not to exactly zero in ridge regression. Lasso therefore leads to simpler models.

Answer 7

Pros 1) Categorical predictors - via the use of model matrices, the penalized regression function automatically binarizes categorical predictors, allowing us to assess the significance of individual factor levels, not just the significance of the entire categorical predictors 2) Tuning by CV - tuning hyperparameters by CV is more conducive to picking a model with good prediction performance than using stepwise selection 3) Computationally more efficient than stepwise selection algorithms Cons 1) Applicability - can't accommodate all of the distributions for GLMs 2) Interpretability - may not produce the most interpretable model, especially for ridge regression. All numeric features are standardized, making their coefficients slightly less intuitive.

Answer 8

1) Descriptive - descriptive analytics focuses on what happened in the past and aims to describe or explain the observed trends by identifying the relationships between variables in the data 2) Predictive - predictive analytics focuses on what will happen in the future and concerned with making accurate predictions 3) Prescriptive - prescriptive analytics uses a combination of optimization and simulation to investigate and quantify the impact of prescribed actions in different scenarios

Answer 9

1) Relevance - the KPIs should align with the overall business objective and the interest of the client as closely as possible 2) Measurability - should be easily measurable and provide an objective, quantitative basis to measure the success of the project

Chapter 3 - Linear Models Flashcards

(33 cards)