Modeling Process Flashcards
How are the DR Reduced Features Created?
DR takes the best performing non-blender model from the leaderboard, and creates a feature list using the Feature Impact Score.
Feature Impact is calculated using permutation importances, and DR uses those set of features that provide 95% of the accumulated impact for the model. If that number is greater than 100 features, then only the top 100 features are used.
In the case of redundant features, DR automatically removes them.
What are informative features?
These are calculated using the ACE during EDA2, and also ‘reasonableness’ checks from EDA1.
DR automatically looks for features that are redundant, have to few values, are unique identifiers from the list.
DR looks for and removes features that may be present target leakage.
What are Raw Features
These are all the features/columns from the user-inputted dataset. These exclude user-derived features, but include those features that were deemed to be non-informative by DR.
What are Univariate Selections/Feature Importances?
These are calculated during EDA2, which is after the user clicks on the start modeling button.
They make use of ACE, alternating conditional expectation, which detects non-linear correlation with the target variable. Has to meet a certain threshold to be deemed informative (.005). During EDA2, DR calculates the feature importance for all variables in the informative feature list against the target, and displays on the project data page.
What is EDA1 & when does it happen?
EDA1 occurs after the user imports their dataset into the DR platform for modeling.
If the dataset is larger than 500mb, it takes a sample and performs the subsequent calculations.
The steps that happen during this phase include:
- inferring feature schema type (categorical, numeric, text, etc..)
- For numerics, calculating summary statistics
- Distributions for top 50 items?
- Column validity (duplicate columns, too many unique values, etc…)
For date features, DR automatically performs the date time feature transformations.
What are the size limits for EDA1/EDA2?
DR takes a sample of 500mb worth of data for EDA1/EDA2.
What is EDA2 & when does it happen?
EDA2 happens after the user presses the start button. DR selects a set of data from EDA1, but excludes the data that will be in the holdout set (to prevent data snooping).
It performs many calculations:
- re-calculation of numerical stats done in EDA1
- Feature correlation to the target (feature importance calculation)
What are the four types of modeling modes in DR?
- Quick
- AutoPilot
- Manual
- Comphrehensive
What are the types of models that are supported in DR?
- Regression
- Time Series
- Binary Classification
- Multi-class Classification
- Anomaly Detection
How does Quick AutoPilot Mode work?
DR selects a subset of models for its blueprints, and only runs one round of elimination.
It uses 32% of the training data in the first round, and chooses the top four models the round to move to the second round. The top two winning models are then blended together.
How does the AutoPilot Mode work?
DR selects a candidate set of Blueprints, after looking at the target variable, and the schema types of the input variables.
Autopilot by default runs on the informative feature list, which are calculated during EDA1/EDA2 process.
It runs through three rounds of elimination in the leaderboard, first starting out with 16% of the data, and selecting the top 16 models to go to the next round.
In the second round, it feeds 32% of the training data to the models, and chooses the top 8.
In the last round, the top 8 models are fed all of the training data *64% of the total, and the top results are calculated.
Blenders are created from the top models of the final round.
Models are initially capped to 500mb worth of the data, but can be changed by either going to the repository, or after it has been completed in the leaderboard.
What are the benefits of using a ‘survival of the fittest’ approach?
Beyond being a marketing gimmick,
- This increases the number of model types you can try out quickly
- You can visualize a learning curve that shows our your loss metric improves over time, and if it would be worth investing in more data
- Faster run time, as the initial models are capped to 500 mb worth of data
How does the Manual modeling mode work?
This gives the user full control over which model to execute. You can choose from the repository which models you want to try out.
How does Comprehensive modeling mode work?
This runs all repository blueprints on the max Autopilot sample size (64%). This will result in extended build times.
When does DR use H20 or SparkML models?
This is installation specific to Hadoop installations. These have to be specified using the Advanced options.
What are workers, and how are they used in the modeling process?
Workers are computational units that process the modeling workflow.
Workers repsonsible for EDA and uploading data are shared in an org.
Modeling workers are assigned by the admin to a specific user.
What are feature associations, and when are they calculated?
Feature associations are an output of EDA2, on the features that are deemed to be informative for modeling purposes from EDA1.
They give information about the correlation of features, using metrics like cramers V and mutual information.
How are missing values handled?
Models like XGBoost handle missing values natively.
For linear models DR handles based upon the case:
- Median imputation of the non-missing values
- Adds a missing value flag, enabling the model to recognize the pattern in structurally missing.
Tree based models, DR imputes with an arbitrary value (i.e -9999), which is algorithmically faster, but gives just as accurate results.
For missing categorial variables, DR treats it as another level in the categories.
My customer wants to do clustering, what should I tell them?
Try to figure out what the underlying business problem is that is driving the perceived need for clustering. For example, if a customer in marketing wants to do clustering, and then choose certain clusters to market to, they will get much better results with a propensity model that directly predicts who will respond to a marketing effort. These models can give a huge lift over the naive clustering approach. In other words, you can find clusters like “clients likely to buy, clients not likely to buy” with supervised learning. Often times cluster analysis has been used because better models were too hard to use, but DataRobot lets them use better models easily
What is the difference between ‘ordinal encoder’ and ‘ordinal variable’?
Ordinal encoding is referring to coding categorial features as numbers, an alternative to one-hot encoding. The phrase “ordinal variable” describes a categorical variable in which the values have an order, for example “good”, “better”, “best”.
What does DR do if there are missing values in my target?
Records with missing values in the target are ignored in EDA2 and modeling. This provides a nice hack; to train on a subset of data after it’s imported, you can derive a new feature which is set to a missing value if certain criteria are met (the criteria that define the records you want to omit.) Now set that variable as a target and DR will drop records where that variable has missings. The easiest way to set a missing in DR is log(0), as follows: where({num_medications}<10,{readmitted},log(0))
What are the pros/cons of univariate selections?
Univariate selections are done quickly, and they capture non-linear relationships between each variable and the target, however they do not capture the importance of a variable in the presence of an interaction.
What data partition is used in the histograms on the Data tab?
Histograms produced via EDA1 use all of the data, up to 500MB. For datasets larger than 500MB, a random 500MB sample is used. For histograms produced by EDA2, all the data except for the holdout (if any) and rows with missing target values are used.
What do the histograms on the Data tab represent the sum of?
Row count or sum of exposures.
What is Fast EDA vs EDA1 vs EDA2?
EDA1 happens when the data is initially ingested. This is done on the full dataset (or a 500MB sample if dataset > 500MB). EDA1 determines feature type, summary statistics, frequency dist for top 50 items, and identifies informative features. EDA2 is done on the same datase as EDA1 but excludes holdout and any rows where target is missing. EDA2 recalculates summary stats and calculates ACE scores. Fast EDA applies to datasets over 5MB with <10k columns, and it simply shows preliminary EDA results based on the uploaded subset of data; when the upload is complete, EDA1 calculates normally and all EDA results reflect the full EDA1 process.
What is the default partitioning used in DataRobot?
By default, DataRobot creates a 20% holdout and five-fold cross-validation
What is an “accuracy-optimized metablueprint” and how is it run?
Runs XGboost models with a lower learning rate and more trees, as well as an XGBoost forest blueprint.
What does a weight do?
Weights are used to control how much influence each record has in model fitting.
What do the green “importance” bars represent on the Data page?
ACE scores, or “Alternating Conditional Expectations” scores, measure a features correlation with the target. ACE scores are capable of detecting non-linear relationships but as they are univariate are unable to detect interaction effects.
What are the pros/cons of the green “importance” bars on the Data page?
ACE scores, or “Alternating Conditional Expectations” scores, measure a features correlation with the target. ACE scores are capable of detecting non-linear relationships but as they are univariate are unable to detect interaction effects.
What do asterisks on the leaderboard metrics mean?
The asterisks mean essntially that the scores are evaluated on in-sample training data.
My data has no missing values, why does feature fit (and feature effects) show a missing category?
FE and FF show a missing value for numeric variables so that you can see the effect of scoring a record with a missing value in that field. For categorical variables, the mode is used when missing values are present, which is the value with the biggest bar in the histogram.
What data partition is used to calculate feature impact?
Feature Impact uses up to 2500 rows selected from the training partition via smart sampling. Smart sampling tries to make the distribution for unbalanced binary targets closer to 50/50 and adjust the sample weights used for scoring, similar to smart downsampling.
What does DR do with the model that is recommended for deployment?
DR identifies the most accurate non-blender model and prepares it for deployment with three steps. FIrst, DR calculates feature impact. Second, DR retrains the model (on the same partition the last model was trained on) on a reduced feature list. Third, for non-time aware models, DR takes the better of the two models (original model and reduce feature list model) and retrains it on data including the validation partition (if doesn’t exceed autopilot size threshold). For time-aware models, DR retrains on the most recent data.
What is the difference between “word cloud” and “text mining”?
These both show the sme information, in different formats. Text mining shows coefficients in a bar graph format. Word cloud shows the normalized version of those coefficients in a more creative format.
What is a “stop word” and what does “filter stop words” mean?
Stop words are the most common words that often have no value in text modeling, ie words like “the”, “at”, “and”, “of”, etc. Filtering stop words removes then from the word cloud.
What is CodeGen and DR Prime? compare/contrast.
Both are downloadable scoring code. Codegen is not available for all models, but for those it is available for, it allows you to download java code that will match API predictions exactly. DR Prime is a model that is run to approximate another model and DR prime allows you to download python or java scoring code, but as this model is an approximation to another model, the predictions returned won’t match exactly. DR Prime is a good option when the model you want to deploy doesn’t support CodeGen but you need scoring code. Neither CodeGen nor DR Prime give prediction explanations.
What are the pros and cons of the 4 DR deployment options?
(1) GUI - simple but uses modeling workers. (2) dedicated prediction server via API - fast, supports prediction explanation, but requires some coding either for API call or batch scoring script (3) Scoring code - fast but no prediction explanations. (4) hadoop in place scoring, brings models to data rather than moving the data to the models, but prediction explanations not available (confirm)
What happens when we change the number of rules on a completed DR Prime model? Why would we want to change the number of rules?
When you change the number of rules, DR refits the rulefit classifier using the number of rules that you choose. You might want to change the number of rules if, for example, decreasing the number of rules leads to a simpler and easier to understand model while only suffering a minor decrease in accuracy.
What happens when I click “Add New Deployment”? What is the purpose of this?
This button allows you to upload prediction data and optionally training data for a model built outside of DR. This allows you to assess model performance via DR model management capabilities.
What do I do if I want to use a different model in deployment? Are there any requirements to do this?
This is done within the deployments section, see the docs. And if the replacement model differs from the current model—because of either features with different names or features with the same name but different data types—DataRobot issues a warning.
What does service health track?
Service health tracks basic functionality of the API pipeline, it does not evaluate model performance in any way. It tracks things like errors, latency, volume, etc.
What are data errors, system errors?
We capture the percentage of requests that returned a prediction request error (4xx) or that returned a server error (5xx).
What is cache hit rate?
The percentage of requests that used a cached model (the model was recently used by other predictions). If not cached, DataRobot has to look the model up, which can cause delays. The prediction server cache holds 16 models by default, dropping the least-used dropped when the limit is reached.
What is model management all about and what is a “deployment” in DataRobot?
Model management is about monitoring your models once deployed for any evidence of problems. Those problems could be techical in terms of latency or errors, or they could be that the data you’re scoring is very different from the data used to train the model. The latter doesn’t necessarily mean there is a problem, but it is something you want to look into and be aware of. The Deployment object was introduced in 2018 to make it easier to change models used in deployment. Prior to this change, deploying a model via API required the API call to be embedded in the customer’s systems, and the API pointed to a model in DataRobot. If you later wanted to change the model used to power those predictions, you had to change the parameters of the API call, which usually would mean getting IT resources involved. With the introdiuction of the Deployment object, the API still requires IT resources for initial setup, but the API points to a deployment, not to a model. The Deployment then points to the model. This means that if you want to change the model used in deployment, you now do it by pointing the Deployment at a different model, which you do from within DataRobot, and you do not have to involve IT as no modifications to the API are needed.
What format does the data need to be in that I submit via API?
The data needs to be a CSV or JSON file
What are the thresholds for the red/yellow/green indicators on the dashboard? Can I change these?
The color coding on the main deployments dashboard gives an overview of all models performance and is not modifiable
What are the thresholds for the red/yellow/green indicators on the Feature Drift? Can I change these?
By default, the drift threshold defaults to 0.15. The Y-axis scales from 0 to the higher of 0.25 and the highest observed drift value. If you are the project owner, you can click the gear icon in the upper right chart corner to change these values.
What is a Feature List and How Should it be Used?
A feature list is analogous to a playlist from a larger music library. It is a list of features, and like a playlist, you can have as many feature lists as you might want as long as those features exist in the larger ‘library’. The primary purpose for feature lists is to tell DR which features to use for modeling. But feature lists are also used to specifiy features that need to be monotonically increasing or decreasing.
What is a Frozen Run
A frozen run is created when you retrain a model on an increased sample of data but you leave the hyperparameters frozen from the prior run. Hyperparameters control how rigid or flexible the model is when it is fitting to data. One of the reasons you might do this is to save time fitting the model, particularly for larger datasets. Another is for regulation. You may need to ensure the same approved hyperparameters are applied when retraining on a larger set of data.
What is a rating table? What type of models generate rating tables?
Rating tables are generated by Generalized Additive Models. They look and feel very much like the output of a GLM: an intercept along with multiplicative coefficients. These were added to DR to support the insurace industry as this is the format traditionally used for pricing plans.
What is step 6 of Analyzing Features actually doing?
“Creating CV and Holdout partitions” this is actually partitioning the data into the different folds for model evaluation and scoring.
What does DataRobot do with “Length” type features? (feet, inches, etc)
Currently, DataRobot will recognize a feet/inches length (such as 15’ 9”) and convert these automatically to inches, treating as a numeric (so changing to 189). This is the only time you will see a “Length” type feature.
What happens when I select AUC as an optimization metric?
If you select AUC, the models on the Leaderboard will be sorted by AUC, grid search will be done via AUC, and feature impact will be via logloss. Also, the models themselves will use their own optimization metric (e.g. gini or entropy for a RF, logloss for elastic net)
What happens to my deployment if I delete a model that it is using?
DataRobot will not allow you to delete a model that is deployed.
What does snowflake mean near a model in the leaderboard tab?
This indicates a frozen run, which means the model is a retrained version of another model, where hyperparameters from the other model are frozen and the model is simply retrained on more observations.
Is it possible to provide a user-defined list of stop words to use in the word cloud?
Currently (May 2019) this functionality does not exist in DR.
What is Data Drift? How is this different than model drift?
Data Drift is referring to changes in the distribution of prediction data vs training data. If you see Data Drift alerts, it’s telling you that the data you’re making predictions on looks different from the data the model used to train. DR uses PSI or “Population Stability Index” to measure this. (This is an alert that you want to look into, perhaps you need to retrain your model to be better aligned with the new population.) . Model’s themselves cannot drift, once they are fit, they are static. However some might use the term model drift to refer to drift in the predictions, which would simply be an indication that the average predicted value is changing over time.
What does the hue of the color (light or dark red/blue) represent in the Hotspot plot?
The color of the rule indicates the relative difference between the average target value for the group defined by the rule and the overall population.
What is the difference between feature impact / feature importance / ace
“First, ACE scores are a univariate measure of correlation between each feature and the response. These are not related to any models. They capture non-linearities but as they are univariate, they do not measure predictive impact in interactions.
Feature Impact is calculated AFTER a model is done fitting. It perturbs the dataset and then uses the model to make predictions, measuring the overall impact on accuracy from each perturbation. This this method directly measures each features complete predictive power, and can be applied to any model. While this is the best direct measure of a feature’s predictive power, it can take hours if the dataset is large or has many variables.
Tree-based variable importance is available for trees only, much like coefficients are available for linear models. Tree-based variable importance measures a variable’s importance indirectly by measuring how the variable is used in the tree (often vs infrequently, etc.). While this doesn’t measure each feature’s predictive power directly, it is close and it’s available immediately when the model is complete, unlike feature impact which can take hours to run.”
What are the pros & cons for downsampling and weighting
The pros are that it speeds up runtime. Because we put a weight after downsampling, the resulting downsampled dataset essentially still retain the same class balance when modeled. When you downsample, you randomly choose 1 record to represent n records. The assumption is that the 1 you kept is representative of the n-1 you discarded, or said differently, the sample you kept is representative of the population in your dataset. This is a safe assumption for large datasets, but the more features you have, the more complex the features, or the more noisy the target, the smaller your 1:n ratio should be.
Is it possible for users to change pre-processing method?
Users cannot modify blueprints, which is where pre-processing is found. However, often a desired preprocessing step not found in one blueprint may be found in another. If the user has preprocessing they want done that they don’t find in a blueprint, they should do this preprocessing outside of DR and load the processed dataset in. This becomes seamless when using the R/Python clients.
What happens if I run feature impact on a model before autopilot is done, will I have to wait until autopilot finishes before the feature impact calculation starts?
You will not have to wait. As of May 2019, the queue logic was modified and feature impact calculations are highest priority and this get processed with the first available workers.
What does the diagonal gray line in the ROC Curve represent?
This represents the result you’d see theoretically if your model were randomly guessing with each prediction.
What insights do the Learning Curves provide? How would you discuss their interpretation with a client?
The phase “learning curve” is used to describe a few different things, but they all relate to how a model improves. Learning curves are often used to show model accuracy (both in-sampel and out-of sample) when tuning hyperparameters. Another variant on learning curves, and this is the one shown in DR, is showing how a model’s performance improves as the model is trained on an increasing number of observations. This is useful to know because often the question will come up as to whether it would be worth training the model on more data or not.
What does the small ‘i’ symbol in the feature list signify?
This indicates that the feature has been derived, either by DR or by the user. DR automatically derives date-related features from dates, e.g. day of week, month of year, etc, and these are indicated with the ‘i icon’. If a user creates a new feature via the “var type transform” functionality, or via the “create f(x) transform”, the icon identifies these features as well.
What are some custom feature transformations that you may recommend to a user with numeric features?
This very much depends on the problem. Any feature transforms that can be thought of outside of the context of a specific problem are likely already built into DR. Only a few things to mention. (1) if a numeric variable has a partial dependance (from a top model) that looks like a step function, it could be worth creating a new categorical feature which maps the numeric to each of the steps. The idea is that while the top model found that (non-linear) pattern, other models may not have, so by giving the other models that variable as a binned categorical, you give the other models a chance to detect that relationship. (2) sometimes customers ask for interactions beyond pairwise. GA2M models show meaningful 2-way interactions, so if you encode these 2-way interactions in the dataset and then run it through DR in a new project, now the GA2M models can interact those 2-way variables with other variables, and detect 3 and 4-way interactions.
What data types are automatically detected when uploading data? (e.g. text vs. categorical, numeric vs. categorical)
DataRobot automatically detects numerical, categorical, dates, percentages, currencies, lengths, and unstructured text. An easy way to convert a numeric to a categoric is to add a letter
How does DR handle outliers in my target? and what does “calculate outliers” do?
By default, DR excludes outliers from the histograms. Pressing this button will then give you a toggle where you’ll be able to toggle between the histogram with and without outliers. Note the histogram bins will likely change as you toggle.
What are stacked predictions?
Stacked predictions are essentially a safe way of making predictions on training data. We cannot use the final model to make predictions on the training partition because the data would be in-sample and would lead to overly optimistic predictions. Stacked predictions come from the model’s internal cross-validation.
How do I get help?
Contact support@datarobot.com, your CFDS, FE, AISM, or AE. Alternatively, use the “blowhorn” icon in the top right of the app.
How do I report a bug?
Contact your CFDS or FE or use the “blowhorn” icon in the top right of the app. Alternatively, can email AISM, AE, or support@datarobot.com
How do I suggest a feature?
Contact your CFDS or FE or use the “blowhorn” icon in the top right of the app. Alternatively, can email AISM, AE, or support@datarobot.com
I want to start my project over from the beginning, do I have to upload the data again?
No, from the manage projects page, click the ‘hamburger’ symbol and select ‘copy project’. This will make a new instance of the project, but will bring you back to the setup screens. Any feature lists created in the original project persist in the cloned project.
In the manage projects screen, I click a model but nothing happens, how do I open it?
When you click the model, the project is now in that model! Go to the data or models screen to view that project.
How do I specify a weight in DR?
Weights are specified in Advanced Options. One column in your dataset will contain the weight that you want models to put on each record. Weights are used to control how much influence each record has in model fitting. This is not to be confused with optimization metrics, which will also change the amount of influence each record has in model fit. For example, changing from RMSE to gamma deviance in a regression model will cause records will very large response values to have less influence in the model fit, but for different reasons. In this example with the optimization metric, the gamma deviance metric is built on the assumption that larger response values are associated with larger variance, ie more noise, and the models fit with gamma deviance therefore put a premium on fitting smaller values, not larger ones, whereas models fit using RMSE try to fit all values equally. Contrast this with weights; for example, putting a weight of 2 on a record has the same effect has having that record in the dataset two times.
How do I set exposures and offsets?
Offsets and exposures are commonly used for insurance loss modeling. See the links for more information, but call in a SME to help with this.
How many explanations can I get for each prediction?
DR will give you up to ten
Is there a dataset size limit on GUI drag and drop predictons?
1GB, as stated in the GUI on the batch predictions tab.
Is it possible to do feature transformation in DataRobot?
Yes - you can manually transform individual predictors using a number of built-in or user-defined mathematical functions.
How do I delete a project?
From the manage projects screen
How does DR handle highly imbalanced data?
Class imbalance is an issue if we evaluate the models using simple metrics like accuracy %. However, DataRobot directly optimizes the models for objectives that are both aligned with the project metric and robust to imbalanced targets (such as logloss).
How do I force a feature to have a monotonic relationship with the target?
The high level workflow is to create a feature list containing the features you want to be monotonically increasing (and another list for decreasing). In Advanced Options you give DR these feature lists in the monotonicity constrain section. These feature lists will be a subset of the feature list you use for modeling.
https://app.datarobot.com/docs/modeling/analyze-models/describe/monotonic.html
I choose an optimization metric before pressing start, so how can I now choose it again on the leaderboard?
The metric you set in Advanced Options prior to modeling is optimized, however once the models are fit, those models can be evaluated using any metric. When you change the metric on the leaderboard, you are asking DR to evaluate each model with that metric, which is very different from telling DR to fit the model to optimize the metric.
How is the lift chart calculated when I choose CV?
k models are built, each validated on a different CV fold. To score fold k, we use model k, which was built on data that excluded fold k. This means that multiple models are being used to create the lift chart on CV data.
How is the actual, predicted and partial dependance calculated for feature fit (and feature effects)?
Actual is the average actual response, predicted is the average predicted response. Partial dependance is computed for a particular feature by setting all rows to the same value for that feature and computing the average prediction, then iteratively doing the same for each possible value of the variable. This shows what happens to the average prediction as the value of that variable changes.
How is feature impact calculated?
Feature impact is calculated with a technique sometimes called “permutation importance”. It is calculated AFTER a model is built and it is a technique that can be applied to any modeling algorithm. The idea is to take the dataset and ‘destroy the information’ in each column (by randomly shuffling the contents of the feature across the dataset), one at a time, make predictions on all the resulting records and calcualte the overall model performance. The permuted variable that had the largest impact on model performance is the most impactful feature, and so forth.
How should I determine how long a realtime prediction will take to score?
The best way to determine this is to test it in your environment.
How does DR decide which model to recommend for deployment?
DR first identifies the most accurate non-blender mode and then prepares it for deployment; the resulting prepared model is labeled “Recommended for Deployment”l. The rationale for this is that non-blenders are faster to score than blenders are. Prediction latency isn’t a concern for some applications so the user should understand this and choose accordingly.
How is tree-based variable importance calculated?
Tree-based variable importance, at a high level, considers how/where a variable is used: how often and in which parts of the tree. The tree based variable importances generally use a node impurity measure (gini, entropy). This measure can be biased towards variables with a lot of categories. A better approach is to use a permutation method like feature impact. Tree-based variable importance is only available for tree-based models, but it is available immediately after the model is finished. You do not have to wait for it as you do with feature impact, and this could be a big benefit if you have many features as that sometimes can take overnight to finish running.
How is tree-based variable importance different from feature impact?
FI is an approach that can be applied to any model and uses permutation importance. It is a direct measure of a feature’s impact on predictions, but it does require computation after model building that can take some time. Tree-based variable importance is only available for tree-based models, and it is a proxy for the impact a variable has on predictions, but it is available immediately when the models are finished fitting.
How do I make predictions once I’ve added a new deployment?
You either make predictions via a POST request (API call) or via the batch scoring script. (Note, if you’re making predictions from R or Python and you’re doing it USING OUR PACKAGES, then you are not hitting the prediction server, you’re using modeling workers.)
How does DataRobot perform Cross Validation?
DR by default uses a 20% holdout and 5-fold CV with stratified sampling. There are several different methods that allow you to separate your training data into different roles while maintaining awareness of different ‘groups’ in your data.
How does DataRobot interact with SAS or what can I do with my SAS models?
How does DataRobot interact with SAS or what can I do with my SAS models?
How does DataRobot select the metric to optimize for as well as the candidate models to run?
On a high level if it is a binary classification problem we always optimize logloss. If its regression we start with RMSE unless data is very skewed than we lean toward Poisson or Gamma . We use Tweedie if the distribution is Zero Inflated.
How is DataRobot determining which models to train on 16/32/64/80 percents of data?
DataRobot trains on 16/32/64 as part of it’s autopilot, but it will start higher than 16 with smaller datasets. The models recommended for deployment (most accurate non-blender) is then retrained at 80%.
How to see which exact data (after preprocessing) DR passed into the model?
Since DR does data preprocessing and feature engineering, the data that is fed into the models will be derived from but different than the data the user uploaded. While you cannot access this derived dataset currently (May 2019), you can see which variables were used by the models. In the insights tab, both Tree-Based Variable Importance as well as the Variable Effects sections show the derived variables used by the models.
How does DataRobot determine which threshold to use for binary classification problem?
“There are two thresholds on the ROC tab:
(a) Threshold - This is interactive and by default set to the threshold that maximizes F1 score. Note that this does not impact predictions, this is solely for analysis in the GUI.
(b) Threshold used for predictions - this by default is set to 0.5 and must be set by the user. This is the threshold used when DR returns predictions. (DR predictions consist of both probabilities as well as a y/n classification, and it’s this classification that uses this threshold.) “