Misc. Flashcards

Question 1

Q

Describe the characteristics of predictive modeling problems

Answer

A

(Issue) There is a clearly identified and defined business issue to be addressed
(Questions) The issue can be addressed with a few well-defined questions
(Data) Good and useful data are available for answering the questions above
(Outcomes) The predictions will likely drive actions or increase understanding
(Better solution) Predictive analytics likely produces a solution better than any existing approach
(Update) We can continue to monitor and update the models when new data becomes available

Question 2

Q

How do you produce a meaningful problem definition?

Answer

A

General Strategy: get to the root cause of the business issue and make it specific enough to be solvable

Specific Strategies:
* (Hypotheses) Use prior knowledge of the business problem to ask questions and develop testable hypotheses
* (KPIs) Select appropriate key performance indicators to provide a quantitative basis for measuring success

Question 3

Q

Define granularity

Answer

A

Granularity refers to how precisely a variable is measured. For example, the level of detail for the information contained by the variable

Question 4

Q

What is the goal of exploratory data analysis (EDA)?

Answer

A

The goal is to use descriptive statistics and graphical displays to gain insights into the distribution of variables on their own and in relation to one another, esp. in relation to the target variable

Question 5

Q

How do you perform EDA?

Answer

A

Clean the data to make it ready for analysis
Identify potentially useful predictors
Generate useful features
Decide which type of model (GLMs or trees) is more suitable. (for highly non-linear relation, trees may do better)

Question 6

Q

List the common issues for numeric variables

Answer

A

Right skewness
Presence of outliers
Highly correlated predictors

Question 7

Q

What is the issue with right skewness for numeric variables and what are possible solutions?

Answer

A

The problem with right skewness is the fact that extreme values distort visualizations and exert a disproportionate effect on the model fit.

The solution would be to apply transformations to remedy right skewness and symmetrize distribution to improve the fit of GLMs if the variables serve as predictors
* Log transformation (works only for strictly positive variables)
* Square root transformations (works for non-negative variables)

Question 8

Q

How can you handle the presence of outliers?

Answer

A

Remove: If an outlier is not likely to have a material effect on the model, then it’s okay to remove it
Keep: If the outliers make up only an insignificant proportion of the data, then it’s okay to leave them in the data
Modify: Modify the outliers to make them more reasonable
Use robust model forms: fit models by minimizing the absolute error rather than squared error between predicted values and observed values. This is because absolute error places much less relative weight on the large errors and reduces the impact of outliers on the fitted model

Question 9

Q

How can you handle highly correlated predictors?

Answer

A

Drop one of the predictors
Use PCA to compress the correlated predictors into a few principal components

Question 10

Q

List some reasons why a numeric variable should be converted to a factor

Answer

A

If the variable has a small number of distinct values
If variable values are merely numeric labels with no sense of numeric order
If the variable has a complex relationship with the target variable. This is because factor conversion gives GLMs more flexibility to capture the relationship

Question 11

Q

List some reasons why a numeric variable should not be converted to a factor

Answer

A

If the variable has a large number of distinct values. This is because it would cause a high dimension if converted into a factor
If variable values have a sense of numeric order
If the variable has a simple monotonic relationship with the target variable. This is because its effect can be effectively captured by a GLM with a single coefficient and wouldn’t need factor conversion
If future observations will have new variable values

Question 12

Q

What is the common issue for categorical predictors and how should we handle them?

Answer

A

The common issue for categorical predictors is sparse levels
* Motivation: sparse factor levels (often for a high dimensional categorical predictor) reduce robustness of models and cause overfitting
* What to do: combine sparse levels with more populous levels where the target variable behaves similarly to form representative groups
* Trade-off: strikes a balance between ensuring each level has a number of observations, and preserving the differences in the behavior of the target variable among different factor levels for prediction

Question 13

Q

What is the difference between interaction and correlation?

Answer

A

Interaction concerns a 3-way relationship with 1 target variable and 2 predictors. Correlation concerns the relationship between 2 numeric predictors

Question 14

Q

Why should we split the data into training data and test data?

Answer

A

Model performance on the training set tends to be overly optimistic and favor complex models
Test set provides a more objective ground for assessing the performance of models on new, unseen data
The split replicates the way the models will be used in practice

Question 15

Q

Why should we use stratified sampling?

Answer

A

To produce representative training and test sets with respect to the target variable (not with respect to the predictors)

Question 16

Q

Explain how cross validation works

Answer

A

Randomly split the training data into k folds of approximately equal size
Train the model on all but one folds and measure the performance on the left-out fold
Repeat with each fold left out in turn to get k performance values
Average to get the overall cross validation error

Question 17

Q

List some common examples of GLMs, and their common distributions and link functions

Answer

A

Real-value with a bell-shaped distributon: Normal (Gaussian) distribution and identity link
Binary (0/1): Binomial distribution and logit link
Count: Poisson distribution and log link
Positive, continuous with right skew: Gamma/inverse Gaussian distribution and log link
Positive, continuous with a large mass at zero: Tweedie distribution and log link

Gamma and inverse Gaussian require the target variable to be strictly positive. Zero values are not allowed

Question 18

Q

Explain what binning is and the pros and cons

Answer

A

Binning refers to converting a numeric variable into a categorical variable with levels defined as non-overlapping intervals over the range of the original variable
* Pros: no definite order among the coefficients of the dummy variables corresponding to different bins which means that the target mean can vary highly irregularly over the bins
* Cons: (1) usually no clear choice of the number of bins and the associated boundaries, and (2) results in a loss of information

Question 19

Q

What are some general statements you can make when interpreting GLMs?

Answer

A

Coefficient estimates capture the effects (magnitude + direction) of features on the target mean
p-values express statistical significance of features; the smaller the p-value, the more significant

Question 20

Q

What are some specific statements you can make when interpreting GLMs with log link?

Answer

A

Numeric case: For a unit change in a numeric predictor with estimated coefficient β, the multiplicative change in the target mean is equal to e^β. the percent change in the target mean is equal to e^β -1
Factor case: For a non-baseline level with estimated coefficient β, the estimated mean for the non-baseline level is equal to e^β * the estimated mean for the baseline level

Question 21

Q

What are weights and offsets?

Answer

A

Weights and offsets are modeling techniques:
* Offsets: usually used when the target observations are aggregated over an exposure unit. The larger the exposure, the larger the mean. It’s commonly used with a log-link GLM, which shows that the target mean is directly proportional to the exposure
* Weights: usually used when the target observations are averaged over an exposure unit. The larger the exposure, the more precise the observations (which means a lower variance). Observations with a larger weight will play a more important role in the estimation of the model coefficients

Question 22

Q

Explain accuracy vs precision in the context of predictive analytics

Answer

A

Accuracy and precision measure different aspects of prediction performance. Bias quantifies the accuracy (when predictions capture the true signal) and variance quantifies the precision (when predictions are concentrated in a small region rather than spread out)