Chapter 3 - Linear Models Flashcards
Types of variables (2; 3.1.1)
1) Numeric - take the form of numbers with a well-defined order and associated range. Leads to supervised regression problems.
a) Discrete - restricted to only certain numeric values in that range
b) Continuous - can assume any value in a continuum
2) Categorical - take the form of predefined values in a countable collection of categories (called levels or classes). Leads to supervised classification problems.
a) Binary - can only take two possible levels (Y/N, etc.)
b) Multi-level - can take any possible levels (State, etc.)
Supervised vs. unsupervised problems (2; 3.1.1)
1) Supervised learning problems - target variable ‘supervises” the analysis; goal is to understand the relationship between the target variable and the predictors and/or to make accurate predictions for the target based on the predictors
2) Unsupervised learning problems - No target variable supervising the analysis. Goal is to extract relationships and structures between different variables in the data.
The model building process (6; 3.1.2)
1) Problem Definition
2) Data Collection and Validation
3) Exploratory Data Analysis - with the use of descriptive statistics and graphical displays, clean the data for incorrect, unreasonable, and inconsistent entries, and understand the characteristics of and key relationships among variables in the data.
4) Model Construction, Evaluation, and Selection
5) Model Validation
6) Model Maintenance
Characteristics of predictive modeling problems (6; 3.1.2)
1) Issue - there is a clearly defined business issue that needs to be addressed
2) Questions - the issue can be addressed with a few well-defined questions
a) What data do we need?
b) What is the target or outcome?
c) What is the success criteria (how will the model performance be evaluated)?
3) Data - Good and useful data is available for answering the questions above
4) Impact - the predictions will likely drive actions or increase understanding
5) Better solution - predictive analytics likely produces a solution better than any existing approach
6) Update - We can continue to monitor and update the models when new data becomes available
Defining the problem (2; 3.1.2)
1) Hypotheses - use prior knowledge of the business problem to ask questions and develop hypotheses to guide analysis efforts in a clearly defined way.
2) Key performance indicators - to provide a quantitative basis to measure the success of the project
Data design considerations (3; 3.1.2)
1) Relevance - representative of the environment where our predictive model will operate
a) Population - data source aligns with the true population of interest
b) Time frame - should best reflect the business environment in which the model will be implemented
2) Sampling - process of taking a manageable subset of observations from the data source. Methods include:
a) Random sampling
b) Stratified sampling - dividing population into strata and randomly sampling a set number of observations from each stratum
3) Granularity - refers to how precisely a variable in a dataset is measured / how detailed the information contained by the variable is
Data quality considerations (3; 3.1.2)
1) Reasonableness - data values should be reasonable
2) Consistency - records in the data should be inputted consistently
3) Sufficient documentation - should at least include the following:
a) A description of the dataset overall, including the data source
b) A description of each variable in the data, including its name, definition, and format
c) Notes about any past updates or other irregularities of the dataset
d) A statement of accountability for the correctness of the dataset
e) A description of the governance processes used to manage the dataset
Other data issues (3; 3.1.2)
1) PII/PHI - Data with PII/PHI should be de-identified and should have sufficient data security protections
2) Variables with legal/ethical concerns - variables with sensitive information or of protected classes may lead to unfair discrimination and raise equity concerns. Care should also be taken with proxy variables of prohibited variables (i.e. , occupation may be a proxy variable of gender)
3) Target leakage - when predictors in a model include information about the target variable that will not be available when the model is applied in practice (i.e., target variable of IP LOS, # of lab procedures may be a predictor variable, but will not be known in practice until the inpatient stay concludes)
a) When target leakage occurs, may develop a predictive model for the leaky predictor, then use the predicted value in a separate model to predict the target variable
Training/test dataset split (3.1.2)
1) Training set (typically 70-80% of the full data) - used for training or developing the predictive model to estimate the signal function and model parameters
2) Test set (typically 20-30% of the full data) - apply the trained model to make a prediction on each observation in the test set to assess the prediction performance
Common performance metrics (3; 3.1.2)
1) Square loss - squared difference between observed and predicted values
a) Root Mean Squared Error (RMSE) - SQRT of the sum of all observations’ square losses (divided by population size)
2) Absolute loss - absolute difference between observed and predicted values (less used than RMSE because not differentiable at zero)
a) Mean Absolute Loss (MAE) - average of all observations’ mean losses
3) Zero-one loss - 1 if predicted and observed losses are not equal, 0 if equal (commonly used in classification problems)
a) Classification error rate = proportion of 1’s in all observations
Cross-validation (3.1.2)
Method for tuning hyperparameters (parameters that control some aspect of the fitting process itself) without having to further divide the training set
1) Randomly split the training data into k folds of approximately equal size (10 default value in many model fitting functions in R)
2) One of the k folds is left out and the predictive model is fitted to the remaining k-1 folds. The fitted model then predicts for each observation for the left out fold and a performance metric is computed on that fold.
3) Repeat the process for all folds, resulting in k performance metric values.
4) Overall prediction performance of the model can be estimated as the average of the k performance values, known as the CV error.
This technique can be used on each set of hyperparameter values under consideration to select the combination that produces the best model performance.
Considerations for selecting the best model (3; 3.1.2)
1) Prediction performance
2) Interpretability - model predictions should be easily explained in terms of the predictors and lead to specific actions or insights
3) Ease of implementation - models should not require prohibitive resources to construct and maintain
Model validation techniques (3; 3.1.2)
1) Training set - for GLMs, there is a set of model diagnostic tools designed to check the model assumptions based on the training set
2) Test set - compare predicted values and the observed values of the target variable on the test set
3) Compare to an existing, baseline model - use a primitive model to provide a benchmark which any selected model should beat as a minimum requirement
Model maintenance steps (5; 3.1.2)
1) Adjust the business problem to account for new information
2) Consult with subject matter experts - if there are new findings that don’t fit current understanding of the business problem or modeling issues that cannot be easily resolved, or to understand limitation on what can be reasonably implemented
3) Gather additional data - gather new observations and retrain model or gather new variables
4) Apply new types of models - when new technology or implementation possibilities are available
5) Refine existing models - try new combinations of predictors, alternative hyperparameter values, alternative performance metrics, etc.
Bias/variance trade-off (4; 3.1.3)
1) Bias - the difference between the expected value of a signal function and the true value of the signal function
a) the more complex/flexible a model, the lower the bias due to its higher ability to capture the signal in the data
b) Corresponds to accuracy
2) Variance - the amount by which the expected value of a signal function would change if estimated using a different training set
a) the more flexible a model, the higher the variance due to its attunement to the training set
b) Corresponds to precision
3) Irreducible error - variance of noise, which is independent of the predictive model but inherent in the random nature of the target variable
4) Bias/variance trade-off - a more flexible model generally has a lower bias but a higher variance than a less flexible model
a) Underfitting - as a model becomes more fitted, bias error drops faster than the variance error increases
b) Overfitting - once a model starts to become overfitted, variance increases faster than bias drops