Section 1: General Model Building Steps Flashcards

1
Q

What is Descriptive Analytics?

A

Focus: What happened in the past?
Aim: to “describe” or interpret observed trends by identifying relationships between variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Predictive Analytics?

A

Focus: What will happen in the future?
Aim: to make accurate “predictions”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Prescriptive Analytics?

A

Focus: The impacts of different “prescribed” decisions (assumptions)
Aim: to answer the “what if?” and “what is the best course of action?” questions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why do we need relevant data?

A

need to ensure that the data is unbiased (i.e., representative of the environment where the model will operate)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Random Sampling?

A

“randomly” draw observations from the underlying population without replacement. each record is equally likely to be sampled

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Stratified Sampling?

A

Divide the underlying population into a number of non-overlapping “strata” non-randomly, then randomly sample a set number of observations from each stratum (this helps you get a more representative sample)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a special case of Stratified Sampling?

A

Systematic sampling: draw observations according to a set pattern; no random mechanism controlling which observations are sampled

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Granularity?

A

referees to how precisely a variable is measured (i.e., level of detail for the information contained by the variable)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are examples of data quality issues? (name at least 3)

A
  1. Reasonableness (ex. variables such as age, time, and income are non-negative)
  2. Consistency (ex. same measurement unit for numeric variables, same coding scheme for categorical variables)
  3. Sufficient Documentation (ex. clear description of each variable)
  4. Personally identifiable info (PII)
  5. Variables with legal/ethical concerns
  6. Target Leakage (on other flashcard)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what is Target Leakage?

A

when predictors in a model “leak” information about the target variable that would not be available when the model is deployed in practice

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Univariate exploration tools (Numeric and categorical

A

Numeric - mean, median, variance, min/max. visuals: histograms, boxplots

categorical - class, frequencies. visuals: bar charts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Bivariate exploration tools (numericXnumeric, numericXcategorical, categoricalXcategorical)

A

NumericXNumeric - correlations. visuals: scatterplots

NumericXCategorical - mean/median of numeric variable split by categorical variable. visuals: split boxplots, histograms

CategoricalXCategorical - 2 way frequency table. visuals: bar charts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the three common data issues for numeric variables?

A
  1. Highly correlated predictors
  2. Skewness (esp. right skewness due to outliers
  3. Should they be converted to a factor?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Highly correlated predictors (problems)

A

a) difficult to separate out the individual effects of different predictors on the target variable
b) For GLMs, coefficients become widely varying in sign and magnitude, difficult to interpret

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Highly correlated predictors (solutions)

A

a) drop one of the strongly correlated predictors
b) Use PCS to compress the correlated predictors into a few PCs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Skewness (problems)

A

Extreme values:
a) exert a disproportionate effect on model fit
b) distort visualizations (e.g., axes expanded inordinately to take care of outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Skewness (solutions)

A

a) transformations to reduce right skewness (Log, Square root)
b) options to handle outliers (Remove, Keep, Modify, Using robust model forms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

A common issue for categorical predictors

A

Sparse levels (reduce the robustness of models and may cause overfitting)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is an interaction?

A

relationship between a predictor and the target variable depends on the value/level of another predictor

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Regression problems are used when a target is __________.

A

numeric (quantitative)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Classification problems are used when a target is __________.

A

categorial (qualitative)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

metrics on a training set measures

A

the goodness of fit of to the training data

23
Q

metrics on the test set measures

A

the prediction performance on new, unseen data

24
Q

What does a loss function do?

A

captures the discrepancy between the actual and predicted values for each observation of the target variable

25
Q

examples of loss functions

A

a) Square Loss (most common for numeric targets)
b) Absolute loss
c) Zero-one loss (mostly for categorical variables)

26
Q

Confusion matrices

A

table showing prediction versus reference (actual) counts

27
Q

Accuracy

A

proportion of correctly classified obs

28
Q

classification error rate

A

proportion of misclassified obs.

29
Q

Sensitivity

A

proportion of +ve obs. correctly classified as +ve

30
Q

Specificity

A

proportion of -ve obs. correctly classified as -ve

31
Q

Precision

A

proportion of +ve predictions truly belonging to +ve class

32
Q

accuracy weighted average relation

A

accuracy = n_/n (specificity) + n+/n (sensitivity)

33
Q

common uses of CV

A

a) Model assessment (able to preform without a test set)
b) Hyperparameter tuning

34
Q

Hyperparameters

A

parameters with values supplied in advance; not optimized by the model fitting algorithm

35
Q

considerations when selecting the best model

A

a) prediction performance
b) interpretability
c) ease of implementation

36
Q

Unbalanced data for binary targets (problems)

A

a) classifier implicitly places more weight on the majority class and tries to fit those observations well, but the minority class may be the +ve class
b) a high accuracy can be deceptive

37
Q

Unbalanced data for binary targets (solution)

A

a) Undersampling - keep all obs. from minority class but draw fewer obs. from majority class (con: you have less data)
b) Oversampling - keep all obs. from majority class, but draw more observations from minority class (con: more data, computational burden)

38
Q

Effects of undersampling and oversampling on model results

A

+ve class becomes more prevalent in the balanced data –> predicted probabilities for +ve class will increase –> for a fixed cutoff, sensitivity increases but specificity decrease

39
Q

what is overfitting?

A

Model is trying too hard to capture not only the signal, but also the noise specific to the training data

40
Q

What indicates overfitting?

A

Small training error, but large test error

41
Q

What problems come from overfitting?

A

An overfitted model fits training data well, but does no generalize well to new, unseen data (poor predictions)

42
Q

relationship between complexity, bias, variance, training error and test error.

A

as complexity increases, variance increases, bias decreases, training error decreases, and the test error has a U-shape

43
Q

mathematical definition of bias

A

difference between the expected value of the predication and the true value

44
Q

mathematical definition of variance

A

amount of variability of the prediction

45
Q

significance of Bias in PA

A

part of the test error caused by the model not being flexible enough to capture the signal (underfitting)

46
Q

significance of Variance in PA

A

part of the test error caused by the model being too complex (overfitting)

47
Q

dimensionality applicability

A

specific to categorical variables

48
Q

granularity applicability

A

applies to both numeric and categorical variables

49
Q

dimensionality comparability

A

two categorical variables can always be ordered by dimension

50
Q

granularity compatibility

A

not always possible to order two variables by granularity

51
Q

what is the aim of model validation?

A

to check that the selected model has no obvious deficiencies and the model assumptions are largely satisfied

52
Q

for a “nice” GLM, the deviance residuals should:

A

1) (purely random) have no systematic patterns
2) (homoscedasticity) have approximately constant variance upon standardization
3) (normality) be approximately normal (for most target distributions)

53
Q

what are the two validation methods based on the test set?

A

1) predicted vs actual values of the target - the two sets of values should be close (can check this quantitatively or graphically)
2) benchmark model - show that the recommended model outpreforms a benchmark model, if one exists (e.g., intercept-only GLM, purely random classifier), on the test set

54
Q

Next steps after model validation

A

1) Adjust the business problem - changes in external factors may cause initial assumptions to shift, so we modify business problem to incorporate the new conditions
2) consult with the subject matter experts - seek validation of model results from external subject matter experts
3) gather additional data - enlarge training data with new obs. and/or variables, and retain the model to improve robustness
4) apply new types of models
5) refine existing models
6) field test proposed model