Misc. Flashcards
Describe the characteristics of predictive modeling problems
- (Issue) There is a clearly identified and defined business issue to be addressed
- (Questions) The issue can be addressed with a few well-defined questions
- (Data) Good and useful data are available for answering the questions above
- (Outcomes) The predictions will likely drive actions or increase understanding
- (Better solution) Predictive analytics likely produces a solution better than any existing approach
- (Update) We can continue to monitor and update the models when new data becomes available
How do you produce a meaningful problem definition?
General Strategy: get to the root cause of the business issue and make it specific enough to be solvable
Specific Strategies:
* (Hypotheses) Use prior knowledge of the business problem to ask questions and develop testable hypotheses
* (KPIs) Select appropriate key performance indicators to provide a quantitative basis for measuring success
Define granularity
Granularity refers to how precisely a variable is measured. For example, the level of detail for the information contained by the variable
What is the goal of exploratory data analysis (EDA)?
The goal is to use descriptive statistics and graphical displays to gain insights into the distribution of variables on their own and in relation to one another, esp. in relation to the target variable
How do you perform EDA?
- Clean the data to make it ready for analysis
- Identify potentially useful predictors
- Generate useful features
- Decide which type of model (GLMs or trees) is more suitable. (for highly non-linear relation, trees may do better)
List the common issues for numeric variables
- Right skewness
- Presence of outliers
- Highly correlated predictors
What is the issue with right skewness for numeric variables and what are possible solutions?
The problem with right skewness is the fact that extreme values distort visualizations and exert a disproportionate effect on the model fit.
The solution would be to apply transformations to remedy right skewness and symmetrize distribution to improve the fit of GLMs if the variables serve as predictors
* Log transformation (works only for strictly positive variables)
* Square root transformations (works for non-negative variables)
How can you handle the presence of outliers?
- Remove: If an outlier is not likely to have a material effect on the model, then it’s okay to remove it
- Keep: If the outliers make up only an insignificant proportion of the data, then it’s okay to leave them in the data
- Modify: Modify the outliers to make them more reasonable
- Use robust model forms: fit models by minimizing the absolute error rather than squared error between predicted values and observed values. This is because absolute error places much less relative weight on the large errors and reduces the impact of outliers on the fitted model
How can you handle highly correlated predictors?
- Drop one of the predictors
- Use PCA to compress the correlated predictors into a few principal components
List some reasons why a numeric variable should be converted to a factor
- If the variable has a small number of distinct values
- If variable values are merely numeric labels with no sense of numeric order
- If the variable has a complex relationship with the target variable. This is because factor conversion gives GLMs more flexibility to capture the relationship
List some reasons why a numeric variable should not be converted to a factor
- If the variable has a large number of distinct values. This is because it would cause a high dimension if converted into a factor
- If variable values have a sense of numeric order
- If the variable has a simple monotonic relationship with the target variable. This is because its effect can be effectively captured by a GLM with a single coefficient and wouldn’t need factor conversion
- If future observations will have new variable values
What is the common issue for categorical predictors and how should we handle them?
The common issue for categorical predictors is sparse levels
* Motivation: sparse factor levels (often for a high dimensional categorical predictor) reduce robustness of models and cause overfitting
* What to do: combine sparse levels with more populous levels where the target variable behaves similarly to form representative groups
* Trade-off: strikes a balance between ensuring each level has a number of observations, and preserving the differences in the behavior of the target variable among different factor levels for prediction
What is the difference between interaction and correlation?
Interaction concerns a 3-way relationship with 1 target variable and 2 predictors. Correlation concerns the relationship between 2 numeric predictors
Why should we split the data into training data and test data?
- Model performance on the training set tends to be overly optimistic and favor complex models
- Test set provides a more objective ground for assessing the performance of models on new, unseen data
- The split replicates the way the models will be used in practice
Why should we use stratified sampling?
To produce representative training and test sets with respect to the target variable (not with respect to the predictors)