Chapter 3 - Linear Models Flashcards

1
Q

Types of variables (2; 3.1.1)

A

1) Numeric - take the form of numbers with a well-defined order and associated range. Leads to supervised regression problems.
a) Discrete - restricted to only certain numeric values in that range
b) Continuous - can assume any value in a continuum

2) Categorical - take the form of predefined values in a countable collection of categories (called levels or classes). Leads to supervised classification problems.
a) Binary - can only take two possible levels (Y/N, etc.)
b) Multi-level - can take any possible levels (State, etc.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Supervised vs. unsupervised problems (2; 3.1.1)

A

1) Supervised learning problems - target variable ‘supervises” the analysis; goal is to understand the relationship between the target variable and the predictors and/or to make accurate predictions for the target based on the predictors

2) Unsupervised learning problems - No target variable supervising the analysis. Goal is to extract relationships and structures between different variables in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

The model building process (6; 3.1.2)

A

1) Problem Definition
2) Data Collection and Validation
3) Exploratory Data Analysis - with the use of descriptive statistics and graphical displays, clean the data for incorrect, unreasonable, and inconsistent entries, and understand the characteristics of and key relationships among variables in the data.
4) Model Construction, Evaluation, and Selection
5) Model Validation
6) Model Maintenance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Characteristics of predictive modeling problems (6; 3.1.2)

A

1) Issue - there is a clearly defined business issue that needs to be addressed

2) Questions - the issue can be addressed with a few well-defined questions
a) What data do we need?
b) What is the target or outcome?
c) What is the success criteria (how will the model performance be evaluated)?

3) Data - Good and useful data is available for answering the questions above

4) Impact - the predictions will likely drive actions or increase understanding

5) Better solution - predictive analytics likely produces a solution better than any existing approach

6) Update - We can continue to monitor and update the models when new data becomes available

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Defining the problem (2; 3.1.2)

A

1) Hypotheses - use prior knowledge of the business problem to ask questions and develop hypotheses to guide analysis efforts in a clearly defined way.

2) Key performance indicators - to provide a quantitative basis to measure the success of the project

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Data design considerations (3; 3.1.2)

A

1) Relevance - representative of the environment where our predictive model will operate
a) Population - data source aligns with the true population of interest
b) Time frame - should best reflect the business environment in which the model will be implemented

2) Sampling - process of taking a manageable subset of observations from the data source. Methods include:
a) Random sampling
b) Stratified sampling - dividing population into strata and randomly sampling a set number of observations from each stratum

3) Granularity - refers to how precisely a variable in a dataset is measured / how detailed the information contained by the variable is

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Data quality considerations (3; 3.1.2)

A

1) Reasonableness - data values should be reasonable
2) Consistency - records in the data should be inputted consistently
3) Sufficient documentation - should at least include the following:
a) A description of the dataset overall, including the data source
b) A description of each variable in the data, including its name, definition, and format
c) Notes about any past updates or other irregularities of the dataset
d) A statement of accountability for the correctness of the dataset
e) A description of the governance processes used to manage the dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Other data issues (3; 3.1.2)

A

1) PII/PHI - Data with PII/PHI should be de-identified and should have sufficient data security protections
2) Variables with legal/ethical concerns - variables with sensitive information or of protected classes may lead to unfair discrimination and raise equity concerns. Care should also be taken with proxy variables of prohibited variables (i.e. , occupation may be a proxy variable of gender)
3) Target leakage - when predictors in a model include information about the target variable that will not be available when the model is applied in practice (i.e., target variable of IP LOS, # of lab procedures may be a predictor variable, but will not be known in practice until the inpatient stay concludes)
a) When target leakage occurs, may develop a predictive model for the leaky predictor, then use the predicted value in a separate model to predict the target variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Training/test dataset split (3.1.2)

A

1) Training set (typically 70-80% of the full data) - used for training or developing the predictive model to estimate the signal function and model parameters
2) Test set (typically 20-30% of the full data) - apply the trained model to make a prediction on each observation in the test set to assess the prediction performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Common performance metrics (3; 3.1.2)

A

1) Square loss - squared difference between observed and predicted values
a) Root Mean Squared Error (RMSE) - SQRT of the sum of all observations’ square losses (divided by population size)

2) Absolute loss - absolute difference between observed and predicted values (less used than RMSE because not differentiable at zero)
a) Mean Absolute Loss (MAE) - average of all observations’ mean losses

3) Zero-one loss - 1 if predicted and observed losses are not equal, 0 if equal (commonly used in classification problems)
a) Classification error rate = proportion of 1’s in all observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Cross-validation (3.1.2)

A

Method for tuning hyperparameters (parameters that control some aspect of the fitting process itself) without having to further divide the training set

1) Randomly split the training data into k folds of approximately equal size (10 default value in many model fitting functions in R)
2) One of the k folds is left out and the predictive model is fitted to the remaining k-1 folds. The fitted model then predicts for each observation for the left out fold and a performance metric is computed on that fold.
3) Repeat the process for all folds, resulting in k performance metric values.
4) Overall prediction performance of the model can be estimated as the average of the k performance values, known as the CV error.

This technique can be used on each set of hyperparameter values under consideration to select the combination that produces the best model performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Considerations for selecting the best model (3; 3.1.2)

A

1) Prediction performance
2) Interpretability - model predictions should be easily explained in terms of the predictors and lead to specific actions or insights
3) Ease of implementation - models should not require prohibitive resources to construct and maintain

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Model validation techniques (3; 3.1.2)

A

1) Training set - for GLMs, there is a set of model diagnostic tools designed to check the model assumptions based on the training set
2) Test set - compare predicted values and the observed values of the target variable on the test set
3) Compare to an existing, baseline model - use a primitive model to provide a benchmark which any selected model should beat as a minimum requirement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Model maintenance steps (5; 3.1.2)

A

1) Adjust the business problem to account for new information
2) Consult with subject matter experts - if there are new findings that don’t fit current understanding of the business problem or modeling issues that cannot be easily resolved, or to understand limitation on what can be reasonably implemented
3) Gather additional data - gather new observations and retrain model or gather new variables
4) Apply new types of models - when new technology or implementation possibilities are available
5) Refine existing models - try new combinations of predictors, alternative hyperparameter values, alternative performance metrics, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Bias/variance trade-off (4; 3.1.3)

A

1) Bias - the difference between the expected value of a signal function and the true value of the signal function
a) the more complex/flexible a model, the lower the bias due to its higher ability to capture the signal in the data
b) Corresponds to accuracy

2) Variance - the amount by which the expected value of a signal function would change if estimated using a different training set
a) the more flexible a model, the higher the variance due to its attunement to the training set
b) Corresponds to precision

3) Irreducible error - variance of noise, which is independent of the predictive model but inherent in the random nature of the target variable

4) Bias/variance trade-off - a more flexible model generally has a lower bias but a higher variance than a less flexible model
a) Underfitting - as a model becomes more fitted, bias error drops faster than the variance error increases
b) Overfitting - once a model starts to become overfitted, variance increases faster than bias drops

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Feature generation and feature selection (2; 3.1.4)

A

1) Feature generation is the process of generating new features (i.e., derivatives of raw variables) based on existing variables in the data.
a) Predictive power - transform the data so that a predictive model can better “absorb” the information
b) Interpretability - a new feature can also make a model easier to interpret by transforming the original variables into something more meaningful or interpretable

2) Feature selection is the process of dropping features or variables with limited predictive power and therefore reducing the dimension of the data
a) Predictive power - feature selection is an attempt to control model complexity and prevent overfitting
b) Interpretability - preference for simpler, cleaner (parsimonious) models

14
Q

Common strategies for reducing the dimensionality of a categorical predictor (3; 3.1.4)

A

1) Combining sparse categories with others - categories with very few observations should be folded into more populous categories in which the target variable exhibits a similar behavior

2) Combining similar categories - if the target variable behaves similarly in two categories of a predictor, these categories can be consolidated without losing much information

3) Using the prior knowledge of a categorical variable - e.g., reducing day of week variable into weekday/weekend

15
Q

Differences between granularity and dimensionality (2; 3.1.4)

A

1) Applicability - dimensionality is a concept specific to categorical variables, while granularity applies to both numerical and categorical variables

2) Comparability - Categorical variables can be ordered by dimension (# of dimensions), but can’t always be ordered by granularity

16
Q

Linear Model Formulation (2; 3.2.1)

A

1) Model equation: Y = B0 + B1X1 + B2X2 + … + BpXp + E
Y = target variable
X1 - Xp = p predictors
B0 = intercept, EV of Y when all predictors are zero
B1 - Bp = unknown regression coefficients
E = unobservable zero-mean random error term, assumed to follow a normal distribution with zero mean and a common variance

2) Model fitting - most popular way is to estimate the regression coefficients using ordinary least squares (OLS) approach. Select Bj’s to minimize the sum of the squared differences between the observed target values and fitted values.

17
Q

Linear model goodness-of-fit measures (2; 3.2.2)

A

1) Residual sum of squares (RSS) - Sum of squares of residuals (from the training set); residual = difference between observed target and fitted value.
a) Absolute goodness-of-fit measure with no upper bound
b) The smaller the RSS, the better the fit of the linear model to the training set

2) Coefficient of determination (R^2)
a) R^2 = 1 - RSS/TSS
b) Total sum of squares (TSS) = sum of differences between each observation’s target value and the mean of target values
c) Range of 0 to 1. The higher the R^2, the better the fit of the model to the training set

R^2 and RSS will always favor more complex models

18
Q

Traditional model selection methods: Hypothesis testing (2; 3.2.2)

A

1) t-test - the t-statistic for a particular predictor is the ratio of it’s associated OLS estimate to the estimated standard deviation (or standard error) of the OLS estimate
a) Measure of the effect of adding the predictor to the model after accounting for the effects of other variables
b) The larger the t-statistic, the stronger the linear association between the predictor and target variable

2) F-test - the F-test is for assessing the joint significance of the entire set of predictors against the alternative hypothesis (at least one of the regression coefficients is non-zero)

19
Q

General model selection measures (2; 3.2.2)

A

1) Akaike Information Criteria (AIC) - defined as -2l + 2*(p+1)
l = maximized loglikelihood of the linear model on the training set
p = number of predictors

a) Goodness of fit to the training data is measured by -2l (the higher the l and lower -2l, the better the fit)
b) Complexity measured by 2(p+1) - the more parameters, the more complex
c) Goal is to minimize the AIC

2) Bayesian Information Criterion (BIC) - defined as -2l + ln(size of training dataset) * (p+1)
a) Different metric for complexity, penalty for parameters is higher in BIC
b) Same goal, to minimize BIC

20
Q

Properties of a well-defined linear model (3; 3.2.2)

A

1) No special patterns - the residuals should cluster around zero in a random fashion, both on their own and when plotted against the fitted values
2) Homoscedasticity (constant variance) - The residuals should possess approximately the same variance
3) Normality - The residuals should be approximately normally distributed

21
Q

Linear model plots and interpretation (2; 3.2.2)

A

1) Residuals vs Fitted plot - plots the residuals of the model against fitted values (with a smooth curve superimposed)
a) Residuals should display no prominent patterns and spread symmetrically in either direction
b) Systematic patterns (e.g., a U shape - implying quadratic error term) or non-uniform spread in the residuals (e.g., funnel shape - implying increase or decrease in variance with fitted values) are symptomatic of an inadequate model specification and heteroscedasticity (the residuals have non-constant variance)
c) Variance-stabilizing transformations can be applied such as the log transformation (requires all y values >= 1, can add constant) or square root transformation (works well if target variable is non-negative)

2) Normal Q-Q plot - graphs the empirical quantiles of the standardized residuals (residuals divided by their standard error) against the theoretical standard normal quantiles. Can be used for checking the normality of the random errors.
a) Points on the plot are expected to lie closely on the 45 degree line passing through the origin if residuals are normally distributed
b) Systematic departures from that line suggest that the normality assumption is not entirely fulfilled and a distribution with a heavier tail for the target variable is warranted.

22
Q

Methods to modify a linear model to handle more complex, non-linear relationships of numeric predictors (3; 3.2.3)

A

1) Polynomial regression - expanding the regression function to higher powers of X
a) i.e. X^2, …, X^m are treated as new, separate features, all with regression coefficients
b) Pros - able to take care of substantially more complex relationships between the target variable and predictors
c) Cons - regression coefficients are more difficult to interpret; no simple way to choose the value of m (although unusual to use m larger than 3 or 4)

2) Binning (using piecewise constant functions) - bin (or band) the numeric variable and convert it into an ordered categorical variables whose levels are defined as non-overlapping intervals over the range of the original variable
a) Pros - Liberates the regression function from assuming any particular shape, allowing the target mean to vary irregularly over the bins. The larger the number of bins used, the wider the variety of relationships between the target variable and the original numeric predictor (and the more flexible the model).
b1) Con - No simple rule to create bins or how the associated boundaries should be selected (introduction of hyperparameters)
b2) Con - Results in a loss of information, ignoring the variation of the target variable within each band.
b3) Small changes in X (that result in switching bins) may lead to an abrupt change in the target mean

3) Using piecewise linear functions - i.e., B1X becomes B1X + B2(X - c)+, where (X - c)+ = max(0, X - c) (call payoff functions)
a) The regression function is linear over each of the two intervals (broken at c), but the slope changes at c
b) Pros - simple, but powerful, with less of the drawbacks of binning. Can be easily interpreted.
c) Cons - the break points must be user-specified in advance (hyperparameters) (same problem in all three options)

23
Q

Handling categorical predictors in a linear model (2; 3.2.3)

A

1) Binarization - feature generation where a categorical variable is split into a collection of binary dummy variables, each of which serves as an indicator of only one level of the categorical predictor.
a) Advantages - binarizing before evaluating data allows for individual levels to be retained/dropped, as opposed to an ‘all-or-nothing’ approach
b) Disadvantages - increases computational time, may lead to non-intuitive or nonsensical results if only a handful of levels of a categorical predictor are retained
2) Baseline level - for a categorical variable with r levels, r-1 dummy variables will be generated, with the excluded level becoming the baseline level.
a) The intercept represents the mean of the target variable at the baseline level
b) Baseline level will often be the most populous level or the level that makes the most inherent sense to be included as the default

24
Q

Interactions in a linear model (4; 3.2.3)

A

1) Definition - an interaction arises if the association between one predictor and the target variable depends on the value of another predictor
2) Interactions between two numeric predictors - if X1 and X2 interact, an interaction variable X1X2 can be added with a separate regression coefficient
3) Interactions between numeric and categorical predictors - a similar X1
X2 variable can be added with a separate regression coefficient
4) Interactions between two categorical predictors - Same thing, add X1*X2

25
Q

Collinearity (4; 3.2.3)

A

1) Definition - two variables are collinear if one is approximately a multiple of the other

2) Problems
a) Variance inflation - coefficient estimates may exhibit high variance, which can lead to counter-intuitive, nonsensical model results (e.g., wildly large positive coefficient for one feature and similarly large negative coefficient for another feature)
b) Interpretation of coefficients - cannot interpret coefficient on one feature when others are held constant, the collinear features will move together

3) Detecting collinearity - can look at the correlation matrix of the numeric predictors. An element of this matrix that is close to 1 or -1 is an indication that there is a pair of highly correlated predictors.

4) Solutions
a) Delete one of the problematic predictors causing collinearity
b) Pre-process the data using dimension reduction techniques, which combine the collinear predictors into a much smaller number of predictors which are far less correlated

26
Q

Linear model feature selection methods (2; 3.2.4)

A

1) Best subset selection - involves fitting a separate linear model for each possible combination of the available features and selecting the model which fares best according to a pre-specified criterion (such as AIC or BIC)
a) Requires 2^p models for p predictors. Therefore infeasible when p >= 20 due to the large search space (2^20 > 1 million possible models)

2) Stepwise selection - stepwise selection algorithms determine the best model from a carefully restricted list of candidate models by sequentially adding or dropping features, one at a time
a) Backward selection - Start with the model with all features, drop the feature that causes (in its absence) the greatest improvement in the model. Continue until no more features can be dropped.
b) Forward selection - start with the model with just the intercept and add the feature that improves the model the most. Continue when no more features can be added that improve the model.
c) Forward selection is more likely to get a simpler model because the starting model is much simpler
d) Maximum # of linear models to fit is 1 + (p * (p + 1))/2

27
Q

Linear model regularization (5; 3.2.5)

A

1) Definition - alternative to stepwise selection for choosing features and reducing the complexity of a linear model

2) Process - consider a single model hosting all of the potentially useful features and fit the model using unconventional techniques that regularize, or shrink, the coefficient estimates towards zero

3) Formula - goal is to minimize the following formula:
SUM[ Yi - (B0 + B1X1 + … + BpXp) ]^2 + lambda * fR(B)
lambda >= 0 is the regularization parameter that controls the extent of regularization and quantifies our preference for simpler models
fR(B) is the penalty function that captures the size of the regression coefficients

4) Common choices of penalty function
a) Ridge regression - sum of squares of the slope coefficients (but not intercept!)
b) Lasso - sum of absolute values of the slope coefficients
c) Elastic net - (1-a) * sum of squares + a * sum of absolute values, where a is the mixing coefficient

5) Lasso has the effect of forcing the coefficient estimates to exactly zero when lambda is sufficiently large, whereas coefficients are reduced, but not to exactly zero in ridge regression. Lasso therefore leads to simpler models.

28
Q

Pros and cons of regularization techniques for feature selection (5; 3.2.5)

A

Pros
1) Categorical predictors - via the use of model matrices, the penalized regression function automatically binarizes categorical predictors, allowing us to assess the significance of individual factor levels, not just the significance of the entire categorical predictors
2) Tuning by CV - tuning hyperparameters by CV is more conducive to picking a model with good prediction performance than using stepwise selection
3) Computationally more efficient than stepwise selection algorithms

Cons
1) Applicability - can’t accommodate all of the distributions for GLMs
2) Interpretability - may not produce the most interpretable model, especially for ridge regression. All numeric features are standardized, making their coefficients slightly less intuitive.

29
Q

Categories of predictive modeling problems (3; 3.1.1)

A

1) Descriptive - descriptive analytics focuses on what happened in the past and aims to describe or explain the observed trends by identifying the relationships between variables in the data

2) Predictive - predictive analytics focuses on what will happen in the future and concerned with making accurate predictions

3) Prescriptive - prescriptive analytics uses a combination of optimization and simulation to investigate and quantify the impact of prescribed actions in different scenarios

30
Q

Desired properties of key performance indicators (2; 3.1.2)

A

1) Relevance - the KPIs should align with the overall business objective and the interest of the client as closely as possible
2) Measurability - should be easily measurable and provide an objective, quantitative basis to measure the success of the project