Modeling Process Flashcards
(245 cards)
How are the DR Reduced Features Created?
DR takes the best performing non-blender model from the leaderboard, and creates a feature list using the Feature Impact Score.
Feature Impact is calculated using permutation importances, and DR uses those set of features that provide 95% of the accumulated impact for the model. If that number is greater than 100 features, then only the top 100 features are used.
In the case of redundant features, DR automatically removes them.
What are informative features?
These are calculated using the ACE during EDA2, and also ‘reasonableness’ checks from EDA1.
DR automatically looks for features that are redundant, have to few values, are unique identifiers from the list.
DR looks for and removes features that may be present target leakage.
What are Raw Features
These are all the features/columns from the user-inputted dataset. These exclude user-derived features, but include those features that were deemed to be non-informative by DR.
What are Univariate Selections/Feature Importances?
These are calculated during EDA2, which is after the user clicks on the start modeling button.
They make use of ACE, alternating conditional expectation, which detects non-linear correlation with the target variable. Has to meet a certain threshold to be deemed informative (.005). During EDA2, DR calculates the feature importance for all variables in the informative feature list against the target, and displays on the project data page.
What is EDA1 & when does it happen?
EDA1 occurs after the user imports their dataset into the DR platform for modeling.
If the dataset is larger than 500mb, it takes a sample and performs the subsequent calculations.
The steps that happen during this phase include:
- inferring feature schema type (categorical, numeric, text, etc..)
- For numerics, calculating summary statistics
- Distributions for top 50 items?
- Column validity (duplicate columns, too many unique values, etc…)
For date features, DR automatically performs the date time feature transformations.
What are the size limits for EDA1/EDA2?
DR takes a sample of 500mb worth of data for EDA1/EDA2.
What is EDA2 & when does it happen?
EDA2 happens after the user presses the start button. DR selects a set of data from EDA1, but excludes the data that will be in the holdout set (to prevent data snooping).
It performs many calculations:
- re-calculation of numerical stats done in EDA1
- Feature correlation to the target (feature importance calculation)
What are the four types of modeling modes in DR?
- Quick
- AutoPilot
- Manual
- Comphrehensive
What are the types of models that are supported in DR?
- Regression
- Time Series
- Binary Classification
- Multi-class Classification
- Anomaly Detection
How does Quick AutoPilot Mode work?
DR selects a subset of models for its blueprints, and only runs one round of elimination.
It uses 32% of the training data in the first round, and chooses the top four models the round to move to the second round. The top two winning models are then blended together.
How does the AutoPilot Mode work?
DR selects a candidate set of Blueprints, after looking at the target variable, and the schema types of the input variables.
Autopilot by default runs on the informative feature list, which are calculated during EDA1/EDA2 process.
It runs through three rounds of elimination in the leaderboard, first starting out with 16% of the data, and selecting the top 16 models to go to the next round.
In the second round, it feeds 32% of the training data to the models, and chooses the top 8.
In the last round, the top 8 models are fed all of the training data *64% of the total, and the top results are calculated.
Blenders are created from the top models of the final round.
Models are initially capped to 500mb worth of the data, but can be changed by either going to the repository, or after it has been completed in the leaderboard.
What are the benefits of using a ‘survival of the fittest’ approach?
Beyond being a marketing gimmick,
- This increases the number of model types you can try out quickly
- You can visualize a learning curve that shows our your loss metric improves over time, and if it would be worth investing in more data
- Faster run time, as the initial models are capped to 500 mb worth of data
How does the Manual modeling mode work?
This gives the user full control over which model to execute. You can choose from the repository which models you want to try out.
How does Comprehensive modeling mode work?
This runs all repository blueprints on the max Autopilot sample size (64%). This will result in extended build times.
When does DR use H20 or SparkML models?
This is installation specific to Hadoop installations. These have to be specified using the Advanced options.
What are workers, and how are they used in the modeling process?
Workers are computational units that process the modeling workflow.
Workers repsonsible for EDA and uploading data are shared in an org.
Modeling workers are assigned by the admin to a specific user.
What are feature associations, and when are they calculated?
Feature associations are an output of EDA2, on the features that are deemed to be informative for modeling purposes from EDA1.
They give information about the correlation of features, using metrics like cramers V and mutual information.
How are missing values handled?
Models like XGBoost handle missing values natively.
For linear models DR handles based upon the case:
- Median imputation of the non-missing values
- Adds a missing value flag, enabling the model to recognize the pattern in structurally missing.
Tree based models, DR imputes with an arbitrary value (i.e -9999), which is algorithmically faster, but gives just as accurate results.
For missing categorial variables, DR treats it as another level in the categories.
My customer wants to do clustering, what should I tell them?
Try to figure out what the underlying business problem is that is driving the perceived need for clustering. For example, if a customer in marketing wants to do clustering, and then choose certain clusters to market to, they will get much better results with a propensity model that directly predicts who will respond to a marketing effort. These models can give a huge lift over the naive clustering approach. In other words, you can find clusters like “clients likely to buy, clients not likely to buy” with supervised learning. Often times cluster analysis has been used because better models were too hard to use, but DataRobot lets them use better models easily
What is the difference between ‘ordinal encoder’ and ‘ordinal variable’?
Ordinal encoding is referring to coding categorial features as numbers, an alternative to one-hot encoding. The phrase “ordinal variable” describes a categorical variable in which the values have an order, for example “good”, “better”, “best”.
What does DR do if there are missing values in my target?
Records with missing values in the target are ignored in EDA2 and modeling. This provides a nice hack; to train on a subset of data after it’s imported, you can derive a new feature which is set to a missing value if certain criteria are met (the criteria that define the records you want to omit.) Now set that variable as a target and DR will drop records where that variable has missings. The easiest way to set a missing in DR is log(0), as follows: where({num_medications}<10,{readmitted},log(0))
What are the pros/cons of univariate selections?
Univariate selections are done quickly, and they capture non-linear relationships between each variable and the target, however they do not capture the importance of a variable in the presence of an interaction.
What data partition is used in the histograms on the Data tab?
Histograms produced via EDA1 use all of the data, up to 500MB. For datasets larger than 500MB, a random 500MB sample is used. For histograms produced by EDA2, all the data except for the holdout (if any) and rows with missing target values are used.
What do the histograms on the Data tab represent the sum of?
Row count or sum of exposures.