Linear Models Flashcards
Explain the difference between descriptive, predictive, and prescriptive modeling
- Descriptive: Focuses on what happened in the past and aims to describe/explain observed trends by identifying relationships between variables
- Predictive: Focuses on what will happen in the future and aims to make accurate predictions
- Prescriptive: Uses a combination of optimization and simulation to investigate and quantify impacts of prescribed actions/decisions to answer “what if” questions
Explain the difference between supervised and unsupervised learning and what their goals are
- Supervised learning: Problems where there is a target variable supervising predictive analysis. Goals are to (1) understand the relationship between the target variable and the predictors and (2) make accurate predictions for the target variable based on the predictors
- Unsupervised learning: Problems where there is no target variable supervising predictive analysis. Goal is to identify relationships, structures, and patterns between different variables in the data
Explain how stratified sampling contributes to a more representative sample than random sampling
Stratified sampling ensures that every stratum is properly represented in the collected data. This is done by dividing the underlying population into non-overlapping groups in a non-random fashion, then randomly sampling a set number of observations from each stratum.
* Oversampling and undersampling –> designed for unbalanced data
* Systematic sampling –> draw observations according to a set pattern to arrive at pre-determined sampled observations (no random mechanism)
Explain the three data quality issues one should examine in practice
- Reasonableness: Do the key statistics for the variables make sense?
- Consistency: Are the records in the data inputted consistently?
- Sufficient documentation: Can other users easily gain an understanding of different aspects of the data?
Explain the problem with target leakage in predictive analytics
Target leakage is when some predictors in a model leak information about the target variable that will not be available when the model is applied in practice. This causes a problem because these variables cannot serve as predictors in practice and would lead to artifially good model performance if mistakenly included.
Explain how to use a time variable to make the training/test set split and the advantage of doing so
A time variable can be used to make the training/test split on the basis of time. This includes allocating the older observations to the training test set and the more recent observations to the test set. This is useful for evaluating how well a model extrapoltaes time trends observed in the past to future, unseen years.
Explain what hyperparameters are and why they are important for a predictive model
Hyperparameters = tuning parameters, which are parameters that control some aspect of the fitting process itself
Explain the difference between bias and variance in a predictive analytics context
Bias = the difference between the expected value of prediction and the true value of the signal function
* Part of the test error caused by the model not being flexible enough (underfitting)
Variance = the amount of variability of prediction
* Part of the test error caused by the model being too complex (overfitting)
Explain the difference between variables and features in a predictive analytic context
Variables = predictors in a model. the original dataset before any transformations
Features = derivations from the original variables to provide a more useful view of the information in the dataset
Explain the difference between dimensionality and granularity
There are two main differences:
1. Applicability: Dimensionality is a concept specific to categorical variables. Granularity applies to both numeric and categorical variables.
2. Comparability: We can always order two categorical variables by dimension, but it is not always possible to order them by granularity.
Explain the problem with RSS and R squared as model selection measures
They are merely goodness-of-fit measures of a linear model to the training data. There is no explicit regard to its complexity or prediction performance
Explain the rationale behind and the difference between the AIC and BIC
Both AIC and BIC can be used as a model selection criterion. However, the penalty term for BIC is higher than that for the AIC. Therefore, the BIC tends to result in a simpler final model than the AIC.
Explain the advantages and disadvantages of polynomial regression
Pros: Polynomial regression can take care of substantially more complex relationships between the target variable and predictors than linear ones. This is because the more polynomial terms, the more flexible the fit can be
Cons: Interpretation and the choice of m. Regression coefficients in polynomial regression are more difficult to interpret. Additionally, there is no simple rule as to how to choose the value of m. However, it can be tuned by CV and EDA can also help
Explain the meaning of interaction
An interaction arises if the association between one predictor and the target variable depends on the value/level of another predictor
Explain how best subset selection works and its limitations
Best subset selection is performed by fitting all p models, where p is the total number of predictors being considered, that contain exactly one predictor and picking the model with smallest deviance, fitting all p choose 2 models that contain exactly 2 predictors and picking the model with lowest deviance, and so forth. Then a single best model is selected from the models picked, using a metric such as AIC. In general, there are 2^p models that are fit, which can be quite a large search space as p increases. (Note: global minimum).