Modeling definitions Flashcards
Modeling Considerations - Before
Descriptive analytics focuses on studying the past to identify relationships and patterns among the variables
Predictive analytics focuses on anticipating the future by using models to make accurate predictions
Prescriptive analytics focuses on the outcome of decisions
A project can involve all three types, but it is important to recognize which ones are essential
Modeling Considerations - During
When things do not go according to plan:
-adjust the business problem
-consult an expert
-collect more data
-attempt different models
-refine current models
Modeling Considerations - After
Modeling work concludes by either implementing or abandoning the model
Implementation is seldom straightforward. In a situation where others need to be convinced in the model’s ability, consider conducting a field test so that the model is applied to a real setting but without acting on the results
In understanding that something critical is missing for success, abandoning would avoid wasting resources
Supervised vs Unsupervised Learning
Studies the data with a target -> Everything centers around analyzing the target through the predictors, hence it is the focus of predictive analytics
Analyzes the data without a target -> The idea is to identify patterns that may exist in the data, but there are no clear objectives or ways to verify the quality of the findings.
–These techniques can be used to create features
–Features used to model predictor impact on a target should NOT be based on the target which would result in target leakage. This is where the model would need to know the target in order to predict the target, which is inappropriate.
Regression vs Classification
Many supervised learning problems can be divided into two types
-Regression involves continuous or count targets
-Classification involves a binary target
Sometimes, we might prefer to reframe a classification problem in terms of a regression problem, such as treating a binary target as numeric
Decomposing the Target
Studying the target by its two parts
1. systematic component
-describes the value that the target gravitates towards (f)
-when viewing the target as a random variable, f is its mean
-want f to be a function of the predictors
-this theory proposes that the mean target depends on the predictors, thus f captures the systematic relationship between the target and the predictors
-In practice, f is unknown, and the first step to making model predictions is to estimate f (f^)
- random component
-captures things about the target that cannot be explained with any predictor, sometimes denoted as e (epsilon)
Y = f(x_1, …x_p) + e OR ‘signal’ + noise
Parametric vs Non-Parametric
Parametric - specifies a functional form for f that includes free parameters
->data is used to estimate these parameters; the downside to this approach is that choosing a functional form can be arbitrary, so the chosen form may be significantly different than the true f
->(MLR, Stepwise selection and Regularization, GLM)
Non-parametric - makes no assumption about f’s function form; there are no parameters to estimate
->having no functional form to depend on means that f^ relies solely on the data; these methods require an abundance of observations to be effective
->(regular decision trees, ensembles of decision trees)
Flexibility
Flexibility describes how closely is f^ able to follow the data. A more flexible f^ follows the data closer than a less flexible f^.
-For parametric methods, higher flexibility often comes form having more free parameters in the functional form.
-For non-parametric methods, they typically have a flexibility measure that is unique to each method and are generally considered more flexible than parametric methods because they are not confined to a functional form for f.
Flexibility and accuracy do not always go hand-in-hand. It may be possible to create an f^ that makes perfect predictions on past data, but that is not the objective. An f^ with good predictive ability on future data is usually one that is not overly flexible.
Interpretability
Describes how easy it is to understand f^. When a model has complicated components in f, the relationship between the target and predictors becomes more difficult to understand.
Flexibility vs Interpretability
Flexibility is often inversely related to interpretability. A highly flexible f^ can follow the data closely, but the complex mathematical parts that enable it are often challenging to interpret.
Less flexible, more interpretable -> Stepwise selection and Regularization
Moderately flexible and interpretable -> MLR, GLM, Regular decision trees
More flexible, less interpretable -> ensembles of decision trees
Comparing models
glm_lasso is a simpler model. It only considers whether the animal is a cat or dog, its age, and whether it arrived via Public Assist. glm_drop includes three additional predictors, making it more cumbersome to explain to a non-technical audience at the animal shelter.
The higher AUC suggests that glm_drop is slightly better at classifying adoptions in this train/test data partition. However, the glm_lasso model has fewer predictors, protecting against overfitting and adding confidence that the model performance will be stable with unseen data.
The interpretability and robustness of glm_lasso outweigh the slight decrease in predictive performance.
o If a more flexible model has worse test metric than a less flexible model, the more flexible model is likely overfit to the training data.