Section 1: General Model Building Steps Flashcards
What is Descriptive Analytics?
Focus: What happened in the past?
Aim: to “describe” or interpret observed trends by identifying relationships between variables
What is Predictive Analytics?
Focus: What will happen in the future?
Aim: to make accurate “predictions”
What is Prescriptive Analytics?
Focus: The impacts of different “prescribed” decisions (assumptions)
Aim: to answer the “what if?” and “what is the best course of action?” questions
Why do we need relevant data?
need to ensure that the data is unbiased (i.e., representative of the environment where the model will operate)
What is Random Sampling?
“randomly” draw observations from the underlying population without replacement. each record is equally likely to be sampled
What is Stratified Sampling?
Divide the underlying population into a number of non-overlapping “strata” non-randomly, then randomly sample a set number of observations from each stratum (this helps you get a more representative sample)
What is a special case of Stratified Sampling?
Systematic sampling: draw observations according to a set pattern; no random mechanism controlling which observations are sampled
What is Granularity?
referees to how precisely a variable is measured (i.e., level of detail for the information contained by the variable)
What are examples of data quality issues? (name at least 3)
- Reasonableness (ex. variables such as age, time, and income are non-negative)
- Consistency (ex. same measurement unit for numeric variables, same coding scheme for categorical variables)
- Sufficient Documentation (ex. clear description of each variable)
- Personally identifiable info (PII)
- Variables with legal/ethical concerns
- Target Leakage (on other flashcard)
what is Target Leakage?
when predictors in a model “leak” information about the target variable that would not be available when the model is deployed in practice
Univariate exploration tools (Numeric and categorical
Numeric - mean, median, variance, min/max. visuals: histograms, boxplots
categorical - class, frequencies. visuals: bar charts
Bivariate exploration tools (numericXnumeric, numericXcategorical, categoricalXcategorical)
NumericXNumeric - correlations. visuals: scatterplots
NumericXCategorical - mean/median of numeric variable split by categorical variable. visuals: split boxplots, histograms
CategoricalXCategorical - 2 way frequency table. visuals: bar charts
What are the three common data issues for numeric variables?
- Highly correlated predictors
- Skewness (esp. right skewness due to outliers
- Should they be converted to a factor?
Highly correlated predictors (problems)
a) difficult to separate out the individual effects of different predictors on the target variable
b) For GLMs, coefficients become widely varying in sign and magnitude, difficult to interpret
Highly correlated predictors (solutions)
a) drop one of the strongly correlated predictors
b) Use PCS to compress the correlated predictors into a few PCs
Skewness (problems)
Extreme values:
a) exert a disproportionate effect on model fit
b) distort visualizations (e.g., axes expanded inordinately to take care of outliers
Skewness (solutions)
a) transformations to reduce right skewness (Log, Square root)
b) options to handle outliers (Remove, Keep, Modify, Using robust model forms
A common issue for categorical predictors
Sparse levels (reduce the robustness of models and may cause overfitting)
What is an interaction?
relationship between a predictor and the target variable depends on the value/level of another predictor
Regression problems are used when a target is __________.
numeric (quantitative)
Classification problems are used when a target is __________.
categorial (qualitative)