Section 1: General Model Building Steps Flashcards
What is Descriptive Analytics?
Focus: What happened in the past?
Aim: to “describe” or interpret observed trends by identifying relationships between variables
What is Predictive Analytics?
Focus: What will happen in the future?
Aim: to make accurate “predictions”
What is Prescriptive Analytics?
Focus: The impacts of different “prescribed” decisions (assumptions)
Aim: to answer the “what if?” and “what is the best course of action?” questions
Why do we need relevant data?
need to ensure that the data is unbiased (i.e., representative of the environment where the model will operate)
What is Random Sampling?
“randomly” draw observations from the underlying population without replacement. each record is equally likely to be sampled
What is Stratified Sampling?
Divide the underlying population into a number of non-overlapping “strata” non-randomly, then randomly sample a set number of observations from each stratum (this helps you get a more representative sample)
What is a special case of Stratified Sampling?
Systematic sampling: draw observations according to a set pattern; no random mechanism controlling which observations are sampled
What is Granularity?
referees to how precisely a variable is measured (i.e., level of detail for the information contained by the variable)
What are examples of data quality issues? (name at least 3)
- Reasonableness (ex. variables such as age, time, and income are non-negative)
- Consistency (ex. same measurement unit for numeric variables, same coding scheme for categorical variables)
- Sufficient Documentation (ex. clear description of each variable)
- Personally identifiable info (PII)
- Variables with legal/ethical concerns
- Target Leakage (on other flashcard)
what is Target Leakage?
when predictors in a model “leak” information about the target variable that would not be available when the model is deployed in practice
Univariate exploration tools (Numeric and categorical
Numeric - mean, median, variance, min/max. visuals: histograms, boxplots
categorical - class, frequencies. visuals: bar charts
Bivariate exploration tools (numericXnumeric, numericXcategorical, categoricalXcategorical)
NumericXNumeric - correlations. visuals: scatterplots
NumericXCategorical - mean/median of numeric variable split by categorical variable. visuals: split boxplots, histograms
CategoricalXCategorical - 2 way frequency table. visuals: bar charts
What are the three common data issues for numeric variables?
- Highly correlated predictors
- Skewness (esp. right skewness due to outliers
- Should they be converted to a factor?
Highly correlated predictors (problems)
a) difficult to separate out the individual effects of different predictors on the target variable
b) For GLMs, coefficients become widely varying in sign and magnitude, difficult to interpret
Highly correlated predictors (solutions)
a) drop one of the strongly correlated predictors
b) Use PCS to compress the correlated predictors into a few PCs
Skewness (problems)
Extreme values:
a) exert a disproportionate effect on model fit
b) distort visualizations (e.g., axes expanded inordinately to take care of outliers
Skewness (solutions)
a) transformations to reduce right skewness (Log, Square root)
b) options to handle outliers (Remove, Keep, Modify, Using robust model forms
A common issue for categorical predictors
Sparse levels (reduce the robustness of models and may cause overfitting)
What is an interaction?
relationship between a predictor and the target variable depends on the value/level of another predictor
Regression problems are used when a target is __________.
numeric (quantitative)
Classification problems are used when a target is __________.
categorial (qualitative)
metrics on a training set measures
the goodness of fit of to the training data
metrics on the test set measures
the prediction performance on new, unseen data
What does a loss function do?
captures the discrepancy between the actual and predicted values for each observation of the target variable
examples of loss functions
a) Square Loss (most common for numeric targets)
b) Absolute loss
c) Zero-one loss (mostly for categorical variables)
Confusion matrices
table showing prediction versus reference (actual) counts
Accuracy
proportion of correctly classified obs
classification error rate
proportion of misclassified obs.
Sensitivity
proportion of +ve obs. correctly classified as +ve
Specificity
proportion of -ve obs. correctly classified as -ve
Precision
proportion of +ve predictions truly belonging to +ve class
accuracy weighted average relation
accuracy = n_/n (specificity) + n+/n (sensitivity)
common uses of CV
a) Model assessment (able to preform without a test set)
b) Hyperparameter tuning
Hyperparameters
parameters with values supplied in advance; not optimized by the model fitting algorithm
considerations when selecting the best model
a) prediction performance
b) interpretability
c) ease of implementation
Unbalanced data for binary targets (problems)
a) classifier implicitly places more weight on the majority class and tries to fit those observations well, but the minority class may be the +ve class
b) a high accuracy can be deceptive
Unbalanced data for binary targets (solution)
a) Undersampling - keep all obs. from minority class but draw fewer obs. from majority class (con: you have less data)
b) Oversampling - keep all obs. from majority class, but draw more observations from minority class (con: more data, computational burden)
Effects of undersampling and oversampling on model results
+ve class becomes more prevalent in the balanced data –> predicted probabilities for +ve class will increase –> for a fixed cutoff, sensitivity increases but specificity decrease
what is overfitting?
Model is trying too hard to capture not only the signal, but also the noise specific to the training data
What indicates overfitting?
Small training error, but large test error
What problems come from overfitting?
An overfitted model fits training data well, but does no generalize well to new, unseen data (poor predictions)
relationship between complexity, bias, variance, training error and test error.
as complexity increases, variance increases, bias decreases, training error decreases, and the test error has a U-shape
mathematical definition of bias
difference between the expected value of the predication and the true value
mathematical definition of variance
amount of variability of the prediction
significance of Bias in PA
part of the test error caused by the model not being flexible enough to capture the signal (underfitting)
significance of Variance in PA
part of the test error caused by the model being too complex (overfitting)
dimensionality applicability
specific to categorical variables
granularity applicability
applies to both numeric and categorical variables
dimensionality comparability
two categorical variables can always be ordered by dimension
granularity compatibility
not always possible to order two variables by granularity
what is the aim of model validation?
to check that the selected model has no obvious deficiencies and the model assumptions are largely satisfied
for a “nice” GLM, the deviance residuals should:
1) (purely random) have no systematic patterns
2) (homoscedasticity) have approximately constant variance upon standardization
3) (normality) be approximately normal (for most target distributions)
what are the two validation methods based on the test set?
1) predicted vs actual values of the target - the two sets of values should be close (can check this quantitatively or graphically)
2) benchmark model - show that the recommended model outpreforms a benchmark model, if one exists (e.g., intercept-only GLM, purely random classifier), on the test set
Next steps after model validation
1) Adjust the business problem - changes in external factors may cause initial assumptions to shift, so we modify business problem to incorporate the new conditions
2) consult with the subject matter experts - seek validation of model results from external subject matter experts
3) gather additional data - enlarge training data with new obs. and/or variables, and retain the model to improve robustness
4) apply new types of models
5) refine existing models
6) field test proposed model