Misc. Flashcards

1
Q

Describe the characteristics of predictive modeling problems

A
  • (Issue) There is a clearly identified and defined business issue to be addressed
  • (Questions) The issue can be addressed with a few well-defined questions
  • (Data) Good and useful data are available for answering the questions above
  • (Outcomes) The predictions will likely drive actions or increase understanding
  • (Better solution) Predictive analytics likely produces a solution better than any existing approach
  • (Update) We can continue to monitor and update the models when new data becomes available
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How do you produce a meaningful problem definition?

A

General Strategy: get to the root cause of the business issue and make it specific enough to be solvable

Specific Strategies:
* (Hypotheses) Use prior knowledge of the business problem to ask questions and develop testable hypotheses
* (KPIs) Select appropriate key performance indicators to provide a quantitative basis for measuring success

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Define granularity

A

Granularity refers to how precisely a variable is measured. For example, the level of detail for the information contained by the variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the goal of exploratory data analysis (EDA)?

A

The goal is to use descriptive statistics and graphical displays to gain insights into the distribution of variables on their own and in relation to one another, esp. in relation to the target variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do you perform EDA?

A
  • Clean the data to make it ready for analysis
  • Identify potentially useful predictors
  • Generate useful features
  • Decide which type of model (GLMs or trees) is more suitable. (for highly non-linear relation, trees may do better)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

List the common issues for numeric variables

A
  • Right skewness
  • Presence of outliers
  • Highly correlated predictors
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the issue with right skewness for numeric variables and what are possible solutions?

A

The problem with right skewness is the fact that extreme values distort visualizations and exert a disproportionate effect on the model fit.

The solution would be to apply transformations to remedy right skewness and symmetrize distribution to improve the fit of GLMs if the variables serve as predictors
* Log transformation (works only for strictly positive variables)
* Square root transformations (works for non-negative variables)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How can you handle the presence of outliers?

A
  • Remove: If an outlier is not likely to have a material effect on the model, then it’s okay to remove it
  • Keep: If the outliers make up only an insignificant proportion of the data, then it’s okay to leave them in the data
  • Modify: Modify the outliers to make them more reasonable
  • Use robust model forms: fit models by minimizing the absolute error rather than squared error between predicted values and observed values. This is because absolute error places much less relative weight on the large errors and reduces the impact of outliers on the fitted model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How can you handle highly correlated predictors?

A
  • Drop one of the predictors
  • Use PCA to compress the correlated predictors into a few principal components
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

List some reasons why a numeric variable should be converted to a factor

A
  • If the variable has a small number of distinct values
  • If variable values are merely numeric labels with no sense of numeric order
  • If the variable has a complex relationship with the target variable. This is because factor conversion gives GLMs more flexibility to capture the relationship
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

List some reasons why a numeric variable should not be converted to a factor

A
  • If the variable has a large number of distinct values. This is because it would cause a high dimension if converted into a factor
  • If variable values have a sense of numeric order
  • If the variable has a simple monotonic relationship with the target variable. This is because its effect can be effectively captured by a GLM with a single coefficient and wouldn’t need factor conversion
  • If future observations will have new variable values
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the common issue for categorical predictors and how should we handle them?

A

The common issue for categorical predictors is sparse levels
* Motivation: sparse factor levels (often for a high dimensional categorical predictor) reduce robustness of models and cause overfitting
* What to do: combine sparse levels with more populous levels where the target variable behaves similarly to form representative groups
* Trade-off: strikes a balance between ensuring each level has a number of observations, and preserving the differences in the behavior of the target variable among different factor levels for prediction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the difference between interaction and correlation?

A

Interaction concerns a 3-way relationship with 1 target variable and 2 predictors. Correlation concerns the relationship between 2 numeric predictors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why should we split the data into training data and test data?

A
  • Model performance on the training set tends to be overly optimistic and favor complex models
  • Test set provides a more objective ground for assessing the performance of models on new, unseen data
  • The split replicates the way the models will be used in practice
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why should we use stratified sampling?

A

To produce representative training and test sets with respect to the target variable (not with respect to the predictors)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Explain how cross validation works

A
  1. Randomly split the training data into k folds of approximately equal size
  2. Train the model on all but one folds and measure the performance on the left-out fold
  3. Repeat with each fold left out in turn to get k performance values
  4. Average to get the overall cross validation error
17
Q

List some common examples of GLMs, and their common distributions and link functions

A
  • Real-value with a bell-shaped distributon: Normal (Gaussian) distribution and identity link
  • Binary (0/1): Binomial distribution and logit link
  • Count: Poisson distribution and log link
  • Positive, continuous with right skew: Gamma/inverse Gaussian distribution and log link
  • Positive, continuous with a large mass at zero: Tweedie distribution and log link

Gamma and inverse Gaussian require the target variable to be strictly positive. Zero values are not allowed

18
Q

Explain what binning is and the pros and cons

A

Binning refers to converting a numeric variable into a categorical variable with levels defined as non-overlapping intervals over the range of the original variable
* Pros: no definite order among the coefficients of the dummy variables corresponding to different bins which means that the target mean can vary highly irregularly over the bins
* Cons: (1) usually no clear choice of the number of bins and the associated boundaries, and (2) results in a loss of information

19
Q

What are some general statements you can make when interpreting GLMs?

A
  • Coefficient estimates capture the effects (magnitude + direction) of features on the target mean
  • p-values express statistical significance of features; the smaller the p-value, the more significant
20
Q

What are some specific statements you can make when interpreting GLMs with log link?

A
  • Numeric case: For a unit change in a numeric predictor with estimated coefficient β, the multiplicative change in the target mean is equal to e^β. the percent change in the target mean is equal to e^β -1
  • Factor case: For a non-baseline level with estimated coefficient β, the estimated mean for the non-baseline level is equal to e^β * the estimated mean for the baseline level
21
Q

What are weights and offsets?

A

Weights and offsets are modeling techniques:
* Offsets: usually used when the target observations are aggregated over an exposure unit. The larger the exposure, the larger the mean. It’s commonly used with a log-link GLM, which shows that the target mean is directly proportional to the exposure
* Weights: usually used when the target observations are averaged over an exposure unit. The larger the exposure, the more precise the observations (which means a lower variance). Observations with a larger weight will play a more important role in the estimation of the model coefficients

22
Q

Explain accuracy vs precision in the context of predictive analytics

A

Accuracy and precision measure different aspects of prediction performance. Bias quantifies the accuracy (when predictions capture the true signal) and variance quantifies the precision (when predictions are concentrated in a small region rather than spread out)