Modeling Process Flashcards

1
Q

How are the DR Reduced Features Created?

A

DR takes the best performing non-blender model from the leaderboard, and creates a feature list using the Feature Impact Score.

Feature Impact is calculated using permutation importances, and DR uses those set of features that provide 95% of the accumulated impact for the model. If that number is greater than 100 features, then only the top 100 features are used.

In the case of redundant features, DR automatically removes them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are informative features?

A

These are calculated using the ACE during EDA2, and also ‘reasonableness’ checks from EDA1.

DR automatically looks for features that are redundant, have to few values, are unique identifiers from the list.

DR looks for and removes features that may be present target leakage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are Raw Features

A

These are all the features/columns from the user-inputted dataset. These exclude user-derived features, but include those features that were deemed to be non-informative by DR.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are Univariate Selections/Feature Importances?

A

These are calculated during EDA2, which is after the user clicks on the start modeling button.

They make use of ACE, alternating conditional expectation, which detects non-linear correlation with the target variable. Has to meet a certain threshold to be deemed informative (.005). During EDA2, DR calculates the feature importance for all variables in the informative feature list against the target, and displays on the project data page.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is EDA1 & when does it happen?

A

EDA1 occurs after the user imports their dataset into the DR platform for modeling.

If the dataset is larger than 500mb, it takes a sample and performs the subsequent calculations.

The steps that happen during this phase include:

  1. inferring feature schema type (categorical, numeric, text, etc..)
  2. For numerics, calculating summary statistics
  3. Distributions for top 50 items?
  4. Column validity (duplicate columns, too many unique values, etc…)

For date features, DR automatically performs the date time feature transformations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the size limits for EDA1/EDA2?

A

DR takes a sample of 500mb worth of data for EDA1/EDA2.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is EDA2 & when does it happen?

A

EDA2 happens after the user presses the start button. DR selects a set of data from EDA1, but excludes the data that will be in the holdout set (to prevent data snooping).

It performs many calculations:

  1. re-calculation of numerical stats done in EDA1
  2. Feature correlation to the target (feature importance calculation)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the four types of modeling modes in DR?

A
  1. Quick
  2. AutoPilot
  3. Manual
  4. Comphrehensive
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the types of models that are supported in DR?

A
  1. Regression
  2. Time Series
  3. Binary Classification
  4. Multi-class Classification
  5. Anomaly Detection
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How does Quick AutoPilot Mode work?

A

DR selects a subset of models for its blueprints, and only runs one round of elimination.

It uses 32% of the training data in the first round, and chooses the top four models the round to move to the second round. The top two winning models are then blended together.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How does the AutoPilot Mode work?

A

DR selects a candidate set of Blueprints, after looking at the target variable, and the schema types of the input variables.

Autopilot by default runs on the informative feature list, which are calculated during EDA1/EDA2 process.

It runs through three rounds of elimination in the leaderboard, first starting out with 16% of the data, and selecting the top 16 models to go to the next round.

In the second round, it feeds 32% of the training data to the models, and chooses the top 8.

In the last round, the top 8 models are fed all of the training data *64% of the total, and the top results are calculated.

Blenders are created from the top models of the final round.

Models are initially capped to 500mb worth of the data, but can be changed by either going to the repository, or after it has been completed in the leaderboard.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the benefits of using a ‘survival of the fittest’ approach?

A

Beyond being a marketing gimmick,

  1. This increases the number of model types you can try out quickly
  2. You can visualize a learning curve that shows our your loss metric improves over time, and if it would be worth investing in more data
  3. Faster run time, as the initial models are capped to 500 mb worth of data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How does the Manual modeling mode work?

A

This gives the user full control over which model to execute. You can choose from the repository which models you want to try out.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How does Comprehensive modeling mode work?

A

This runs all repository blueprints on the max Autopilot sample size (64%). This will result in extended build times.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

When does DR use H20 or SparkML models?

A

This is installation specific to Hadoop installations. These have to be specified using the Advanced options.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are workers, and how are they used in the modeling process?

A

Workers are computational units that process the modeling workflow.

Workers repsonsible for EDA and uploading data are shared in an org.

Modeling workers are assigned by the admin to a specific user.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are feature associations, and when are they calculated?

A

Feature associations are an output of EDA2, on the features that are deemed to be informative for modeling purposes from EDA1.

They give information about the correlation of features, using metrics like cramers V and mutual information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How are missing values handled?

A

Models like XGBoost handle missing values natively.

For linear models DR handles based upon the case:

  1. Median imputation of the non-missing values
  2. Adds a missing value flag, enabling the model to recognize the pattern in structurally missing.

Tree based models, DR imputes with an arbitrary value (i.e -9999), which is algorithmically faster, but gives just as accurate results.

For missing categorial variables, DR treats it as another level in the categories.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

My customer wants to do clustering, what should I tell them?

A

Try to figure out what the underlying business problem is that is driving the perceived need for clustering. For example, if a customer in marketing wants to do clustering, and then choose certain clusters to market to, they will get much better results with a propensity model that directly predicts who will respond to a marketing effort. These models can give a huge lift over the naive clustering approach. In other words, you can find clusters like “clients likely to buy, clients not likely to buy” with supervised learning. Often times cluster analysis has been used because better models were too hard to use, but DataRobot lets them use better models easily

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the difference between ‘ordinal encoder’ and ‘ordinal variable’?

A

Ordinal encoding is referring to coding categorial features as numbers, an alternative to one-hot encoding. The phrase “ordinal variable” describes a categorical variable in which the values have an order, for example “good”, “better”, “best”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What does DR do if there are missing values in my target?

A

Records with missing values in the target are ignored in EDA2 and modeling. This provides a nice hack; to train on a subset of data after it’s imported, you can derive a new feature which is set to a missing value if certain criteria are met (the criteria that define the records you want to omit.) Now set that variable as a target and DR will drop records where that variable has missings. The easiest way to set a missing in DR is log(0), as follows: where({num_medications}<10,{readmitted},log(0))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are the pros/cons of univariate selections?

A

Univariate selections are done quickly, and they capture non-linear relationships between each variable and the target, however they do not capture the importance of a variable in the presence of an interaction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What data partition is used in the histograms on the Data tab?

A

Histograms produced via EDA1 use all of the data, up to 500MB. For datasets larger than 500MB, a random 500MB sample is used. For histograms produced by EDA2, all the data except for the holdout (if any) and rows with missing target values are used.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What do the histograms on the Data tab represent the sum of?

A

Row count or sum of exposures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is Fast EDA vs EDA1 vs EDA2?

A

EDA1 happens when the data is initially ingested. This is done on the full dataset (or a 500MB sample if dataset > 500MB). EDA1 determines feature type, summary statistics, frequency dist for top 50 items, and identifies informative features. EDA2 is done on the same datase as EDA1 but excludes holdout and any rows where target is missing. EDA2 recalculates summary stats and calculates ACE scores. Fast EDA applies to datasets over 5MB with <10k columns, and it simply shows preliminary EDA results based on the uploaded subset of data; when the upload is complete, EDA1 calculates normally and all EDA results reflect the full EDA1 process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is the default partitioning used in DataRobot?

A

By default, DataRobot creates a 20% holdout and five-fold cross-validation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is an “accuracy-optimized metablueprint” and how is it run?

A

Runs XGboost models with a lower learning rate and more trees, as well as an XGBoost forest blueprint.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What does a weight do?

A

Weights are used to control how much influence each record has in model fitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What do the green “importance” bars represent on the Data page?

A

ACE scores, or “Alternating Conditional Expectations” scores, measure a features correlation with the target. ACE scores are capable of detecting non-linear relationships but as they are univariate are unable to detect interaction effects.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What are the pros/cons of the green “importance” bars on the Data page?

A

ACE scores, or “Alternating Conditional Expectations” scores, measure a features correlation with the target. ACE scores are capable of detecting non-linear relationships but as they are univariate are unable to detect interaction effects.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What do asterisks on the leaderboard metrics mean?

A

The asterisks mean essntially that the scores are evaluated on in-sample training data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

My data has no missing values, why does feature fit (and feature effects) show a missing category?

A

FE and FF show a missing value for numeric variables so that you can see the effect of scoring a record with a missing value in that field. For categorical variables, the mode is used when missing values are present, which is the value with the biggest bar in the histogram.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What data partition is used to calculate feature impact?

A

Feature Impact uses up to 2500 rows selected from the training partition via smart sampling. Smart sampling tries to make the distribution for unbalanced binary targets closer to 50/50 and adjust the sample weights used for scoring, similar to smart downsampling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What does DR do with the model that is recommended for deployment?

A

DR identifies the most accurate non-blender model and prepares it for deployment with three steps. FIrst, DR calculates feature impact. Second, DR retrains the model (on the same partition the last model was trained on) on a reduced feature list. Third, for non-time aware models, DR takes the better of the two models (original model and reduce feature list model) and retrains it on data including the validation partition (if doesn’t exceed autopilot size threshold). For time-aware models, DR retrains on the most recent data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What is the difference between “word cloud” and “text mining”?

A

These both show the sme information, in different formats. Text mining shows coefficients in a bar graph format. Word cloud shows the normalized version of those coefficients in a more creative format.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What is a “stop word” and what does “filter stop words” mean?

A

Stop words are the most common words that often have no value in text modeling, ie words like “the”, “at”, “and”, “of”, etc. Filtering stop words removes then from the word cloud.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is CodeGen and DR Prime? compare/contrast.

A

Both are downloadable scoring code. Codegen is not available for all models, but for those it is available for, it allows you to download java code that will match API predictions exactly. DR Prime is a model that is run to approximate another model and DR prime allows you to download python or java scoring code, but as this model is an approximation to another model, the predictions returned won’t match exactly. DR Prime is a good option when the model you want to deploy doesn’t support CodeGen but you need scoring code. Neither CodeGen nor DR Prime give prediction explanations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What are the pros and cons of the 4 DR deployment options?

A

(1) GUI - simple but uses modeling workers. (2) dedicated prediction server via API - fast, supports prediction explanation, but requires some coding either for API call or batch scoring script (3) Scoring code - fast but no prediction explanations. (4) hadoop in place scoring, brings models to data rather than moving the data to the models, but prediction explanations not available (confirm)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What happens when we change the number of rules on a completed DR Prime model? Why would we want to change the number of rules?

A

When you change the number of rules, DR refits the rulefit classifier using the number of rules that you choose. You might want to change the number of rules if, for example, decreasing the number of rules leads to a simpler and easier to understand model while only suffering a minor decrease in accuracy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What happens when I click “Add New Deployment”? What is the purpose of this?

A

This button allows you to upload prediction data and optionally training data for a model built outside of DR. This allows you to assess model performance via DR model management capabilities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What do I do if I want to use a different model in deployment? Are there any requirements to do this?

A

This is done within the deployments section, see the docs. And if the replacement model differs from the current model—because of either features with different names or features with the same name but different data types—DataRobot issues a warning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What does service health track?

A

Service health tracks basic functionality of the API pipeline, it does not evaluate model performance in any way. It tracks things like errors, latency, volume, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

What are data errors, system errors?

A

We capture the percentage of requests that returned a prediction request error (4xx) or that returned a server error (5xx).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

What is cache hit rate?

A

The percentage of requests that used a cached model (the model was recently used by other predictions). If not cached, DataRobot has to look the model up, which can cause delays. The prediction server cache holds 16 models by default, dropping the least-used dropped when the limit is reached.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What is model management all about and what is a “deployment” in DataRobot?

A

Model management is about monitoring your models once deployed for any evidence of problems. Those problems could be techical in terms of latency or errors, or they could be that the data you’re scoring is very different from the data used to train the model. The latter doesn’t necessarily mean there is a problem, but it is something you want to look into and be aware of. The Deployment object was introduced in 2018 to make it easier to change models used in deployment. Prior to this change, deploying a model via API required the API call to be embedded in the customer’s systems, and the API pointed to a model in DataRobot. If you later wanted to change the model used to power those predictions, you had to change the parameters of the API call, which usually would mean getting IT resources involved. With the introdiuction of the Deployment object, the API still requires IT resources for initial setup, but the API points to a deployment, not to a model. The Deployment then points to the model. This means that if you want to change the model used in deployment, you now do it by pointing the Deployment at a different model, which you do from within DataRobot, and you do not have to involve IT as no modifications to the API are needed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

What format does the data need to be in that I submit via API?

A

The data needs to be a CSV or JSON file

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

What are the thresholds for the red/yellow/green indicators on the dashboard? Can I change these?

A

The color coding on the main deployments dashboard gives an overview of all models performance and is not modifiable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What are the thresholds for the red/yellow/green indicators on the Feature Drift? Can I change these?

A

By default, the drift threshold defaults to 0.15. The Y-axis scales from 0 to the higher of 0.25 and the highest observed drift value. If you are the project owner, you can click the gear icon in the upper right chart corner to change these values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

What is a Feature List and How Should it be Used?

A

A feature list is analogous to a playlist from a larger music library. It is a list of features, and like a playlist, you can have as many feature lists as you might want as long as those features exist in the larger ‘library’. The primary purpose for feature lists is to tell DR which features to use for modeling. But feature lists are also used to specifiy features that need to be monotonically increasing or decreasing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

What is a Frozen Run

A

A frozen run is created when you retrain a model on an increased sample of data but you leave the hyperparameters frozen from the prior run. Hyperparameters control how rigid or flexible the model is when it is fitting to data. One of the reasons you might do this is to save time fitting the model, particularly for larger datasets. Another is for regulation. You may need to ensure the same approved hyperparameters are applied when retraining on a larger set of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

What is a rating table? What type of models generate rating tables?

A

Rating tables are generated by Generalized Additive Models. They look and feel very much like the output of a GLM: an intercept along with multiplicative coefficients. These were added to DR to support the insurace industry as this is the format traditionally used for pricing plans.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

What is step 6 of Analyzing Features actually doing?

A

“Creating CV and Holdout partitions” this is actually partitioning the data into the different folds for model evaluation and scoring.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

What does DataRobot do with “Length” type features? (feet, inches, etc)

A

Currently, DataRobot will recognize a feet/inches length (such as 15’ 9”) and convert these automatically to inches, treating as a numeric (so changing to 189). This is the only time you will see a “Length” type feature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

What happens when I select AUC as an optimization metric?

A

If you select AUC, the models on the Leaderboard will be sorted by AUC, grid search will be done via AUC, and feature impact will be via logloss. Also, the models themselves will use their own optimization metric (e.g. gini or entropy for a RF, logloss for elastic net)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

What happens to my deployment if I delete a model that it is using?

A

DataRobot will not allow you to delete a model that is deployed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

What does snowflake mean near a model in the leaderboard tab?

A

This indicates a frozen run, which means the model is a retrained version of another model, where hyperparameters from the other model are frozen and the model is simply retrained on more observations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

Is it possible to provide a user-defined list of stop words to use in the word cloud?

A

Currently (May 2019) this functionality does not exist in DR.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

What is Data Drift? How is this different than model drift?

A

Data Drift is referring to changes in the distribution of prediction data vs training data. If you see Data Drift alerts, it’s telling you that the data you’re making predictions on looks different from the data the model used to train. DR uses PSI or “Population Stability Index” to measure this. (This is an alert that you want to look into, perhaps you need to retrain your model to be better aligned with the new population.) . Model’s themselves cannot drift, once they are fit, they are static. However some might use the term model drift to refer to drift in the predictions, which would simply be an indication that the average predicted value is changing over time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

What does the hue of the color (light or dark red/blue) represent in the Hotspot plot?

A

The color of the rule indicates the relative difference between the average target value for the group defined by the rule and the overall population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

What is the difference between feature impact / feature importance / ace

A

“First, ACE scores are a univariate measure of correlation between each feature and the response. These are not related to any models. They capture non-linearities but as they are univariate, they do not measure predictive impact in interactions.

Feature Impact is calculated AFTER a model is done fitting. It perturbs the dataset and then uses the model to make predictions, measuring the overall impact on accuracy from each perturbation. This this method directly measures each features complete predictive power, and can be applied to any model. While this is the best direct measure of a feature’s predictive power, it can take hours if the dataset is large or has many variables.

Tree-based variable importance is available for trees only, much like coefficients are available for linear models. Tree-based variable importance measures a variable’s importance indirectly by measuring how the variable is used in the tree (often vs infrequently, etc.). While this doesn’t measure each feature’s predictive power directly, it is close and it’s available immediately when the model is complete, unlike feature impact which can take hours to run.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

What are the pros & cons for downsampling and weighting

A

The pros are that it speeds up runtime. Because we put a weight after downsampling, the resulting downsampled dataset essentially still retain the same class balance when modeled. When you downsample, you randomly choose 1 record to represent n records. The assumption is that the 1 you kept is representative of the n-1 you discarded, or said differently, the sample you kept is representative of the population in your dataset. This is a safe assumption for large datasets, but the more features you have, the more complex the features, or the more noisy the target, the smaller your 1:n ratio should be.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

Is it possible for users to change pre-processing method?

A

Users cannot modify blueprints, which is where pre-processing is found. However, often a desired preprocessing step not found in one blueprint may be found in another. If the user has preprocessing they want done that they don’t find in a blueprint, they should do this preprocessing outside of DR and load the processed dataset in. This becomes seamless when using the R/Python clients.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

What happens if I run feature impact on a model before autopilot is done, will I have to wait until autopilot finishes before the feature impact calculation starts?

A

You will not have to wait. As of May 2019, the queue logic was modified and feature impact calculations are highest priority and this get processed with the first available workers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

What does the diagonal gray line in the ROC Curve represent?

A

This represents the result you’d see theoretically if your model were randomly guessing with each prediction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

What insights do the Learning Curves provide? How would you discuss their interpretation with a client?

A

The phase “learning curve” is used to describe a few different things, but they all relate to how a model improves. Learning curves are often used to show model accuracy (both in-sampel and out-of sample) when tuning hyperparameters. Another variant on learning curves, and this is the one shown in DR, is showing how a model’s performance improves as the model is trained on an increasing number of observations. This is useful to know because often the question will come up as to whether it would be worth training the model on more data or not.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

What does the small ‘i’ symbol in the feature list signify?

A

This indicates that the feature has been derived, either by DR or by the user. DR automatically derives date-related features from dates, e.g. day of week, month of year, etc, and these are indicated with the ‘i icon’. If a user creates a new feature via the “var type transform” functionality, or via the “create f(x) transform”, the icon identifies these features as well.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

What are some custom feature transformations that you may recommend to a user with numeric features?

A

This very much depends on the problem. Any feature transforms that can be thought of outside of the context of a specific problem are likely already built into DR. Only a few things to mention. (1) if a numeric variable has a partial dependance (from a top model) that looks like a step function, it could be worth creating a new categorical feature which maps the numeric to each of the steps. The idea is that while the top model found that (non-linear) pattern, other models may not have, so by giving the other models that variable as a binned categorical, you give the other models a chance to detect that relationship. (2) sometimes customers ask for interactions beyond pairwise. GA2M models show meaningful 2-way interactions, so if you encode these 2-way interactions in the dataset and then run it through DR in a new project, now the GA2M models can interact those 2-way variables with other variables, and detect 3 and 4-way interactions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

What data types are automatically detected when uploading data? (e.g. text vs. categorical, numeric vs. categorical)

A

DataRobot automatically detects numerical, categorical, dates, percentages, currencies, lengths, and unstructured text. An easy way to convert a numeric to a categoric is to add a letter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

How does DR handle outliers in my target? and what does “calculate outliers” do?

A

By default, DR excludes outliers from the histograms. Pressing this button will then give you a toggle where you’ll be able to toggle between the histogram with and without outliers. Note the histogram bins will likely change as you toggle.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

What are stacked predictions?

A

Stacked predictions are essentially a safe way of making predictions on training data. We cannot use the final model to make predictions on the training partition because the data would be in-sample and would lead to overly optimistic predictions. Stacked predictions come from the model’s internal cross-validation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

How do I get help?

A

Contact support@datarobot.com, your CFDS, FE, AISM, or AE. Alternatively, use the “blowhorn” icon in the top right of the app.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

How do I report a bug?

A

Contact your CFDS or FE or use the “blowhorn” icon in the top right of the app. Alternatively, can email AISM, AE, or support@datarobot.com

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
73
Q

How do I suggest a feature?

A

Contact your CFDS or FE or use the “blowhorn” icon in the top right of the app. Alternatively, can email AISM, AE, or support@datarobot.com

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
74
Q

I want to start my project over from the beginning, do I have to upload the data again?

A

No, from the manage projects page, click the ‘hamburger’ symbol and select ‘copy project’. This will make a new instance of the project, but will bring you back to the setup screens. Any feature lists created in the original project persist in the cloned project.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
75
Q

In the manage projects screen, I click a model but nothing happens, how do I open it?

A

When you click the model, the project is now in that model! Go to the data or models screen to view that project.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
76
Q

How do I specify a weight in DR?

A

Weights are specified in Advanced Options. One column in your dataset will contain the weight that you want models to put on each record. Weights are used to control how much influence each record has in model fitting. This is not to be confused with optimization metrics, which will also change the amount of influence each record has in model fit. For example, changing from RMSE to gamma deviance in a regression model will cause records will very large response values to have less influence in the model fit, but for different reasons. In this example with the optimization metric, the gamma deviance metric is built on the assumption that larger response values are associated with larger variance, ie more noise, and the models fit with gamma deviance therefore put a premium on fitting smaller values, not larger ones, whereas models fit using RMSE try to fit all values equally. Contrast this with weights; for example, putting a weight of 2 on a record has the same effect has having that record in the dataset two times.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
77
Q

How do I set exposures and offsets?

A

Offsets and exposures are commonly used for insurance loss modeling. See the links for more information, but call in a SME to help with this.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
78
Q

How many explanations can I get for each prediction?

A

DR will give you up to ten

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
79
Q

Is there a dataset size limit on GUI drag and drop predictons?

A

1GB, as stated in the GUI on the batch predictions tab.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
80
Q

Is it possible to do feature transformation in DataRobot?

A

Yes - you can manually transform individual predictors using a number of built-in or user-defined mathematical functions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
81
Q

How do I delete a project?

A

From the manage projects screen

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
82
Q

How does DR handle highly imbalanced data?

A

Class imbalance is an issue if we evaluate the models using simple metrics like accuracy %. However, DataRobot directly optimizes the models for objectives that are both aligned with the project metric and robust to imbalanced targets (such as logloss).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
83
Q

How do I force a feature to have a monotonic relationship with the target?

A

The high level workflow is to create a feature list containing the features you want to be monotonically increasing (and another list for decreasing). In Advanced Options you give DR these feature lists in the monotonicity constrain section. These feature lists will be a subset of the feature list you use for modeling.

https://app.datarobot.com/docs/modeling/analyze-models/describe/monotonic.html

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
84
Q

I choose an optimization metric before pressing start, so how can I now choose it again on the leaderboard?

A

The metric you set in Advanced Options prior to modeling is optimized, however once the models are fit, those models can be evaluated using any metric. When you change the metric on the leaderboard, you are asking DR to evaluate each model with that metric, which is very different from telling DR to fit the model to optimize the metric.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
85
Q

How is the lift chart calculated when I choose CV?

A

k models are built, each validated on a different CV fold. To score fold k, we use model k, which was built on data that excluded fold k. This means that multiple models are being used to create the lift chart on CV data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
86
Q

How is the actual, predicted and partial dependance calculated for feature fit (and feature effects)?

A

Actual is the average actual response, predicted is the average predicted response. Partial dependance is computed for a particular feature by setting all rows to the same value for that feature and computing the average prediction, then iteratively doing the same for each possible value of the variable. This shows what happens to the average prediction as the value of that variable changes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
87
Q

How is feature impact calculated?

A

Feature impact is calculated with a technique sometimes called “permutation importance”. It is calculated AFTER a model is built and it is a technique that can be applied to any modeling algorithm. The idea is to take the dataset and ‘destroy the information’ in each column (by randomly shuffling the contents of the feature across the dataset), one at a time, make predictions on all the resulting records and calcualte the overall model performance. The permuted variable that had the largest impact on model performance is the most impactful feature, and so forth.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
88
Q

How should I determine how long a realtime prediction will take to score?

A

The best way to determine this is to test it in your environment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
89
Q

How does DR decide which model to recommend for deployment?

A

DR first identifies the most accurate non-blender mode and then prepares it for deployment; the resulting prepared model is labeled “Recommended for Deployment”l. The rationale for this is that non-blenders are faster to score than blenders are. Prediction latency isn’t a concern for some applications so the user should understand this and choose accordingly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
90
Q

How is tree-based variable importance calculated?

A

Tree-based variable importance, at a high level, considers how/where a variable is used: how often and in which parts of the tree. The tree based variable importances generally use a node impurity measure (gini, entropy). This measure can be biased towards variables with a lot of categories. A better approach is to use a permutation method like feature impact. Tree-based variable importance is only available for tree-based models, but it is available immediately after the model is finished. You do not have to wait for it as you do with feature impact, and this could be a big benefit if you have many features as that sometimes can take overnight to finish running.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
91
Q

How is tree-based variable importance different from feature impact?

A

FI is an approach that can be applied to any model and uses permutation importance. It is a direct measure of a feature’s impact on predictions, but it does require computation after model building that can take some time. Tree-based variable importance is only available for tree-based models, and it is a proxy for the impact a variable has on predictions, but it is available immediately when the models are finished fitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
92
Q

How do I make predictions once I’ve added a new deployment?

A

You either make predictions via a POST request (API call) or via the batch scoring script. (Note, if you’re making predictions from R or Python and you’re doing it USING OUR PACKAGES, then you are not hitting the prediction server, you’re using modeling workers.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
93
Q

How does DataRobot perform Cross Validation?

A

DR by default uses a 20% holdout and 5-fold CV with stratified sampling. There are several different methods that allow you to separate your training data into different roles while maintaining awareness of different ‘groups’ in your data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
94
Q

How does DataRobot interact with SAS or what can I do with my SAS models?

A

How does DataRobot interact with SAS or what can I do with my SAS models?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
95
Q

How does DataRobot select the metric to optimize for as well as the candidate models to run?

A

On a high level if it is a binary classification problem we always optimize logloss. If its regression we start with RMSE unless data is very skewed than we lean toward Poisson or Gamma . We use Tweedie if the distribution is Zero Inflated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
96
Q

How is DataRobot determining which models to train on 16/32/64/80 percents of data?

A

DataRobot trains on 16/32/64 as part of it’s autopilot, but it will start higher than 16 with smaller datasets. The models recommended for deployment (most accurate non-blender) is then retrained at 80%.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
97
Q

How to see which exact data (after preprocessing) DR passed into the model?

A

Since DR does data preprocessing and feature engineering, the data that is fed into the models will be derived from but different than the data the user uploaded. While you cannot access this derived dataset currently (May 2019), you can see which variables were used by the models. In the insights tab, both Tree-Based Variable Importance as well as the Variable Effects sections show the derived variables used by the models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
98
Q

How does DataRobot determine which threshold to use for binary classification problem?

A

“There are two thresholds on the ROC tab:

(a) Threshold - This is interactive and by default set to the threshold that maximizes F1 score. Note that this does not impact predictions, this is solely for analysis in the GUI.
(b) Threshold used for predictions - this by default is set to 0.5 and must be set by the user. This is the threshold used when DR returns predictions. (DR predictions consist of both probabilities as well as a y/n classification, and it’s this classification that uses this threshold.) “

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
99
Q

Do I have to use the GUI or can I interact programmatically?

A

You can do almost everything via our R and Python clients that you can do with the GUI.

100
Q

Can I share my project for others to view? for others to work on?

A

Yes, when you share a project you can make the person an owner, a user, or an observer. Observers can only observe. Users can do everything the owner can do except delete the project or unlock the holdout.

101
Q

DR is suggesting regression, can I force it to do classification?

A

In the window where you specify the target, simply click “switch to classification”. This option will only be enabled if the numeric feature has no more than 100 unique values.

102
Q

Do I have to run one at a time from the repository, or can I run several at once?

A

You can launch a batch run.

103
Q

Can I download the lift chart via the GUI?

A

Yes, via the ‘export’ button

104
Q

Can I view the lift chart in more granularity than deciles?

A

Yes, 10, 12, 15, 20, 30 or 60 bins

105
Q

Can I view the lift chart on training data?

A

No, this is intentionally omitted.

106
Q

Does DR provide a ROC curve for all models?

A

No, it is only for binary classification.

107
Q

How do I change the prediction threshold?

A

You will see two differen thresholds displayed on the ROC tab. You can change the threshold to experiment and look at different confusion matrices on the ROC tab, but doing so does NOT change the threshold used predictions are made. That threshold can also be set in the ‘Threshold set for Prediction Output’ section of the ROC tab (as well as in Deployments).

108
Q

Can I export feature fit (or feature effects) via the GUI?

A

Yes, via the ‘export’ button

109
Q

Can I tune model hyperparameters?

A

Yes, you can tune model hyperparameters in Advanced Tuning which is found on the Evaluate menu within a particular model. The recommendation is that often it’s better to spend your time doing feature engineering than tuning hyperparameters.

110
Q

How can I see which features are most important?

A

To see which are most strongly correlated with the target on a univariate, ie non-modeling basis, look at ACE scores. To see which features are most important according to a particular model, look at feature impact.

111
Q

Can I see the reasons why a model made a certain prediction?

A

After you build models, you can use Prediction Explanations to help understand the reasons DataRobot generated individual predictions.

112
Q

How can I change the metric used on the vertical axis of the learning curve?

A

The Speed vs Accuracy display is based on the validation score, using the currently selected metric

113
Q

How can I compare performance of my models?

A

There are many things you might want to compare between two or more models, but a good first place to look is the model comparison exhibit.

114
Q

Can I download coefficients for all variables?

A

Yes, simply press the export button

115
Q

How can I get predictions from DataRobot?

A

(1) Batch predictions via the GUI or R/Python clients, (2) Predictions via the API, either (a) real time or (b) in batch with our batch scoring script, (3) downloadable scoring code either (a) codegen or (b) DR prime, and (4) in-place distributed scoring on Hadoop.

116
Q

Can I download predictions along with the entire training dataset, ie all rows and all columns?

A

Via the GUI, you can download up to 5 additional columns with predictions, and you can get predictions for every row in the training dataset. “Stacked predictions” are returned for records that were in the training partition.

117
Q

Can I download all downloadables at once instead of poking all around the UI?

A

Yes, for any individual model, go to the predict tab and the downloads section, then there is an option to download all charts for that model.

118
Q

Does DataRobot have ETL Capabilities?

A

Yes, DataRobot has various options for data preparation. There are some data preparation abilities built directly into the core application. With the release of 5.3, AI catalog has also added substantial ETL capabilities. Additionally, in late 2019 DataRobot acquired Paxata to bring a robust set of automated ETL functionality to the DataRobot platform.

119
Q

Do I have to fix class imbalance on the dataset before loading into DR?

A

No. Class imbalance is an issue if we evaluate the models using simple metrics like accuracy %. However, DataRobot directly optimizes the models for objectives that are both aligned with the project metric and robust to imbalanced targets (such as logloss). If the project metric is different (e.g. AUC), it is used afterwards to fine-tune the hyperparameters of the models.

120
Q

Can you explain OTV?

A

Out of time validation, or date-time partitioning, is an alternative to TVH. WithTVH,you randomly choose some records to train on and some records to test on. With OTV, you choose to train on records from earlier periods in time and then validate on records from later periods in time.

121
Q

Can you explain the concept of model lift?

A

Technically, “Lift is a measure of the effectiveness of a predictive model calculated as the ratio between the results obtained with and without the predictive model.” This is derived from the Cumulative Gain chart, not the lift chart. The Cumulative Gain chart, though it looks like an ROC curve and is based on the same underlying data, should not be conflused with or compared to the ROC curve, they show very different things. Lift is the ratio of points on the Cumulative Gains plot to points on the 45-degree or identify plot. The ratios of these points create the cumulative lift chart.

122
Q

Can I specify which value of the target DR uses as the positive class?

A

In Advanced Options -> Additional -> Positive Class Assignment

123
Q

Feature constraints can only be applied to which algorithms?

A

Only Extreme Gradient Boosted Trees, Generalized Additive Models and Frequency/Severity (both frequency and severity model based on XGBoost) and Frequency/Cost model (both frequency and cost model based on XGBoost) support training with monotonic constraints.

124
Q

Can you define the optimizations metric yourself?

A

“No, but DataRobot chooses from a comprehensive set of metrics and recommends one well suited for the given data and target. Users can change the metric from the range of choices provided. Users can also calculate accuracy on training data using any user defined metric post optimization. If customers request a particular metric, this can be shared with our product team who will then evaluate and prioritize the request. This recently resulted in the KS metric being added to DataRobot.

125
Q

DR shows the matthews correlation coefficient, can it show cohens kappa or others?

A

There are many metrics for assessing binary classification models, not all of them are available inside of DR. Often these can be calculated by downloading the data from DR, for example Cohen’s kappa can be calculated using the exported ROC curve data.

126
Q

Can I get feature fit (and feature effects) for all features?

A

FE and FF are available for the top 500 features (using ACE scores in the feature fit exhibit, using feature impact for feature effects.)

127
Q

Can I get feature impact for all features?

A

The graph shows the top 30, but the top 1000 features are available via export as csv.

128
Q

Describe the three metrics shown in the hotspots table.

A

Mean target is the mean value of the target for records that satisfy the given rule. Mean Relative to Target is simply Mean target divided by the average target value for all records. % observations is the % of observations that satisfy the rule.

129
Q

Describe the “galaxy plot” on the hotspots page

A

The size of the spot indicates the number of observations that follow the rule. The color of the rule indicates the relative difference between the average target value for the group defined by the rule and the overall population

130
Q

How can I keep my predictions from getting caught in a modeling queue?

A

This is the purpose of the dedicated prediction server. When you submit a dataset to DR through either the GUI or via the R or Python client, you are submitting to modeling workers. If those workers are busy building models, your prediction job will be queued. Since the prediction server only makes predictions and is sized according to customer needs, it will virtually never have a queue. You have to either make a POST request (ie not use the DR clients), or use the batch scoring script, which is a wrapper around the API call.

131
Q

Does model management track data scored with batch scoring script?

A

Yes. Prediction requests larger than 5 MB will not be included in data drift statistics, but will be included in service health statistics. When using the batch scoring script, keep the batch size below 5 MB to ensure data drift statistics are captured

132
Q

Can I see a log of the changes that have been made to a deployment or the underlying models?

A

Yes, this is displayed in the overview section for a specific deployment

133
Q

Does the response time in service health include time to traverse the network?

A

The report does not include time due to network latency.

134
Q

Can I see stats on my deployment month by month?

A

Drag the blue slider at the top of the service health screen to narrow down a time frame.

135
Q

How can I tell if the accuracy of my deployed model is deteriorating?

A

To monitor accuracy, predicted outcomes must be compared to actual outcomes. DR currently (release 5.0) has the ability to track accuracy for external model.

136
Q

Explain the feature drift vs feature imporance plot and it’s value.

A

This graph plots the top 10 features (minus any text/percentage/currency features) and the target so that you can see if the distribution of impactful features in the scoring data is significantly different from the distribution of the feature in the training data.

137
Q

Can I compare feature drift for my newly deployed model vs the prior model deployed?

A

Yes,by toggling the version selector you can see drift experienced with different models (ie over different periods of time.)

138
Q

Can I see feature drift for all variables in my model?

A

No, DR monitors drift for the ten most impactful features

139
Q

Docs say that DR tracks drift on the top 10 features, but I only see 9.

A

This graph plots the top 10 features minus any text/percentage/currency features, so you can end up with less than ten.

140
Q

Can I see a sample of the API call to make predictions?

A

You can find this in the integrations section of the deployment.

141
Q

Can customers use the prediction app for their deployment instead of coding via API?

A

Yes, the prediction app can be launched from an existing deployment. There are several other types of applications (in beta or GA) to suit different needs.

142
Q

Does DataRobot detect reference IDs in small datasets? in large datasets?

A

No, with a dataset that is too small (<2000 records), attempting to automatically idenfiy reference ID columns gives too many false positives. With small datasets, the user needs to manually pre-process the data to remove reference IDs or create a feature list that excludes them.

143
Q

Does DR do any upsampling like SMOTE?

A

No, DR models handle class imbalances without the need for upsampling. Class imbalance is an issue if we evaluate the models using simple metrics like accuracy %. However, DataRobot directly optimizes the models for objectives that are both aligned with the project metric and robust to imbalanced targets (such as log-loss). If the project metric is different (e.g. AUC), it is used afterwards to fine-tune the hyperparameters of the models.

144
Q

How can I make a categorical variable an ordered categorical (or ordinal) variable?

A

To create an ordinal variable, the levels of the variable must be mapped to numbers indicating the order of the levels. Note that doing so imposes not just an order but also a distance between levels that may not exist. For exampe, ordering ‘good’, ‘better’, ‘best’ as 1, 2, 3 might make sense. But, does ‘poor’, ‘fair’, ‘average’, ‘good’, ‘excellent’ deserve 1,2,3,4,5, or would 1, 3,4,5, 7 make more sense? That is, someone may change a good rating to excellent with little thought, but they may require really compelling evidence to make the leap from good to excellent. Thus the ‘distance’ between these is not uniform. Assigning numbers gives both order and distance.

145
Q

Can you control how to group or partition your data for validation / cross-validation?

A

The default partitioning is random for regression and stratified for classification, but other appropriate partitioning methods are possible. For time dependent data, users can select Date/Time partitioning aka OTP (Out of Time Partitioning.) Column-based partitioning (Partition Feature) or Group partitioning (Group) can be used to create a more deterministic partitioning method.

146
Q

Can DataRobot accurately select the right blueprint for the given problem?

A

DataRobot doesn’t “select” blueprints, it’s creates them dynamically based on the dataset that you give it and the target you specify! This process is where most of DR’s data scientists have spend 7+ years embedding their data science knowledge and best practices. Roughly several dozen blueprints are created for each project, because although various approaches are viable, you don’t know which ones will actually perform the best, so DR let’s them compete wtih one another on the leaderboard.

147
Q

How are there partial dependence values for “missing” values when there are no “missing” in my dataset?

A

You may not have missing values in your modeling dataset, but you may get missing values in scoring data at prediction time, so we show you this to show you the effect those will have. We calculate this the same as the other values, that is, we set all values equal to missing and calculate the average prediction.

148
Q

Are observations dropped or shuffled during computation of Feature Impact?

A

Feature impact uses a (smart) sample of 2500 records and then applies the permutation importance approach, ie shuffles contents of columns one at a time.

149
Q

For a categorical variable with N levels, how many indicator variables does DataRobot’s one-hot encoding create?

A

DR fully encodes categoricals when it one-hot encodes, so in this case it will produce N indicator or dummy variables.

150
Q

Can you use Smart Downsampling on a regression problem? If so, what are the constraints for regression problems?

A

Yes, smart downsampling can be used on a regression problem in which the response is zero-inflated, aka Tweedie distributed, which means a large proportion of the records have a response value of zero.

151
Q

A client’s binary classification problem has a target of either ‘Red’ or ‘Green’. Which class would default to the positive class (1) in DataRobot? How can you change the positive class?

A

By default, DR will assign the second value when sorted in alphabetical order as the positive class. If you load a dataset with a target of {1,0} or {Yes, No}, or {True, False}, the positive class in each case is 1, Yes and True respectively. These happen to be second when sorted in alphabetics order. So if your target has {a,b}, b will be used as the positive class by default. This can be changed in advanced settings.

152
Q

Explain the difference between starting in Autopilot, Quick or Manual mode, and provide transparent reasoning for why it may be best to start in Autopilot, and in what instances it makes sense to use the others.

A

(See link for description of the three modes.) Modeling projects often require iteration. Autopilot takes longer to run but it is the most powerful, while quick and manual take less time. I recommend users start DR with autopilot on the first iteration. They make observations and get ideas to make improvements, ie feature engineering, joining new features, etc. For the next few iterations, I recommend running manual mode and refitting only those blueprints that performed the best on the initial autopilot run. After a few iterations, ie after the dataset has been modified and enriched fairly significantly, I’d re-run autopilot again, as now other algorithms might do better than they did initially. I’d repeat this process; autopilot every 5th run, manual in between, to help speed up the modeling process. (Unless time is no concern, then just run autopilot everytime!)

153
Q

Can you do residual modeling and/or combine multiple models?

A

You can set up a workflow to link together multiple models and their output by using the modeling API. Combining the predictions generated by these models allows you to avoid assumptions. For example in a frequency x severity model, you can avoid the correlation between frequency and severity that are necessary when using a single model based on Tweedie variance. Exporting the predictions for a dataset from within DataRobot lets you look at the residuals and perform analyses.

154
Q

Can DR connect to our database?

A

Yes, but only onprem. (Connecting our public cloud would require making database accessible to anyone.)

155
Q

Can I change the variable type?

A

Yes

156
Q

Are there any DataRobot educational videos or tutorials?

A

The DataRobot community content is being actively created (Q1 2020) to support self-service customers. New content is being constantly created and posted. There is a CFDS squad (Sustainable Success) dedicated to this effort along with many others in the company. DataRobot offers in-depth hands-on training through DataRobot University. Plus, our CFDS’s are more than happy to help provide hands-on support via Webex or in person.

157
Q

Background of the Data Science Team

A

As of Jan 2020, we have over 200 data scientist and machine learning engineers at the company, in addition to over 400 software engineers. We have a wide variety of backgrounds in these groups, but over 20 of our employees (including our founders) were kaggle masters and grandmasters at one point.

158
Q

I just want to know which features are most important in my business problem, but I am getting different feature impacts from different models, what am I supposed to make of this?

A

It’s important to remember that the real world situation that you are modeling is infinitely complex, and any model we build is an approximation to that complex system. Each model has it’s strengths and weaknesses, and different models are able to capture varying degrees of that underlying complexity. For example, a model that is not capable of detecting non-linear relationships or interactions will use the variables in a certain way, while a model that can detect these relationships will use the variables differently, and so you will get different feature impacts from different models. Feature impact shouldn’t be drastically different, however, so while the exact ordering will change, the overall inference is often not impacted. Collinearity can also impact this. If two variables are highly correlated, a regularized linear model will tend to use only one of them, while a tree-based method will tend to use both at different splits. So with the linear model, one of these variables will show up high in feature importance and the other will be low, while with the tree-based model, both will be closer to the middle.

159
Q

In the speed vs accuracy chart, what exactly does speed measure?

A

Speed is showing the time it takes for the model to score 2000 records in milliseconds. Most importantly, it is NOT measuring round-trip API call, i.e. network latency. This may be of interest, and this must be tested in the customers systems.

160
Q

I want to cluster the records in my training data, can DR do that?

A

Hotspots can be used this way, but it depends on what the customer wants the clustering to tell them or what they plan to use it for. Also, be aware that there are many overlapping ‘clusters’ and you’d probably only want to use disjoint clusters.

161
Q

I had a word cloud for my top model, but the “recommended for deployment” version of that model does not have a word cloud, why?

A

Word Cloud insights, whether accessed via the Leaderboard or the Insights tab, are not available when a text mining model is trained into the validation set or at 100%.

162
Q

I downloaded training data and saw negative values in the partition field, what does this mean?

A

A partition value of “-2” means the target value was missing.

163
Q

I downloaded training data and with the partition field. I see holdout labeled, but which of the CV folds were used for validation?

A

The validation partition, although unmarked, is the largest partition, by number of rows, that isn’t the holdout partition. In the case of a tie, DataRobot chooses a partition randomly from those that were largest.

164
Q

If I want to change a model in production, what do I do to my API code?

A

Nothing. The API calls the deployment in DR. The deployment points to a model, so you only need to make the change within the deployment inside of DR. Since your API will continue to reference the same deployment, no changes are needed.

165
Q

If you use 3 different learning rates in the hyperparameters, how many models do you end up with on the leaderboard?

A

You will end up with three models. Each learning rate will produce a different model in the leaderboard that you can compare directly.

166
Q

I have a large dataset I need scored, what are my options?

A

If your data is on a hadoop cluster, you should score it in place. If it is not on a hadoop cluster, then use our batch scoring script which will send it to the dedicated prediction server.

167
Q

In Autopilot, what % of the training data is used for training in Rounds 1, 2, and 3, respectively?

A

The default is 16/32/64, however the final stage is always capped at 500MB. Round 3 will therefore run on 500MB/{dataset size}, round 2 will be 250MB/{dataset size}, and round 1 will be 125MB/{dataset size}.

168
Q

In which case we can not calc feature importance?

A

Feature importance is not calculated for multicass classification projects.

169
Q

How do I delete a model?

A

From the leaderboard

170
Q

How do I delete a feature list?

A

From the feature lists screen

171
Q

Where can I see my advanced settings after I press start?

A

In the top right, click the “folder” and then “show more”.

172
Q

Where is the list of all DR blueprints?

A

There is no such list. DR creates all blueprints dynamically specifically for the dataset you give it

173
Q

Why did my models not start training on 16%?

A

Autopilot starts at 64%,32% or 16% depending if there are <2k rows, 2k-4k rows, or 4k+ rows.

174
Q

I just want to know which features are most important in my business problem, but I am getting different feature impacts from different models, what am I supposed to make of this?

A

It’s important to remember that the real world situation that you are modeling is infinitely complex, and any model we build is an approximation to that complex system. Each model has it’s strengths and weaknesses, and different models are able to capture varying degrees of that underlying complexity. For example, a model that is not capable of detecting non-linear relationships or interactions will use the variables in a certain way, while a model that can detect these relationships will use the variables differently, and so you will get different feature impacts from different models. Feature impact shouldn’t be drastically different, however, so while the exact ordering will change, the overall inference is often not impacted. Collinearity can also impact this. If two variables are highly correlated, a regularized linear model will tend to use only one of them, while a tree-based method will tend to use both at different splits. So with the linear model, one of these variables will show up high in feature importance and the other will be low, while with the tree-based model, both will be closer to the middle.

175
Q

In the speed vs accuracy chart, what exactly does speed measure?

A

Speed is showing the time it takes for the model to score 2000 records in milliseconds. Most importantly, it is NOT measuring round-trip API call, i.e. network latency. This may be of interest, and this must be tested in the customers systems.

176
Q

Why doesn’t DR retrain the recommended for deployment model on the full dataset?

A

Model management uses the holdout to make predictions which it can then compare against predictions on new data, to track changes in average prediction value.

177
Q

Why would I not just always use the most accurate model?

A

There could be several reasons, but two most common are (1) prediction latency and (2) organizational readiness. To clarify on the latter, some organizations very much favor linear models and/or decision trees for perceived interpretability reasons.

178
Q

What partition of data is used to calculate tree-based variable importance?

A

Tree-based variable importance is based on the model fit on the training data.

179
Q

Why is tree based variable importance telling me something different than feature impact is?

A

Feature impact is a direct measure of a feature’s impact on predictions, while tree-based variable importance is a proxy for the impact a variable has on predictions, thus they can have different results.

180
Q

I want to cluster the records in my training data, can DR do that?

A

Hotspots can be used this way, but it depends on what the customer wants the clustering to tell them or what they plan to use it for. Also, be aware that there are many overlapping ‘clusters’ and you’d probably only want to use disjoint clusters.

181
Q

What partition of data is used to calculate hotspots?

A

Training Data

182
Q

What partition of data is used to calculate coefficients?

A

These are a byproduct of the model fitting process, thus training data creates coefficients.

183
Q

Why do I see variables in the coefficients table and tree-based variable importance that are not in my dataset or in feature impact?

A

DR does a lot of data preprocessing, e.g. transforms, derived features such as ratios & differences, etc. When linear models are fit, they are fit to these derived/processed variables, not the raw features that were in the dataset. Thus it is these processed variables that get coefficients, not the raw variables, and this is what you see in the coefficient tables.

184
Q

I had a word cloud for my top model, but the “recommended for deployment” version of that model does not have a word cloud, why?

A

Word Cloud insights, whether accessed via the Leaderboard or the Insights tab, are not available when a text mining model is trained into the validation set or at 100%.

185
Q

What partition of data is used to calculate word cloud and text mining?

A

The word cloud is based on coefficients which are determined when the model is fit, which means the training partition is used to create the word cloud.

186
Q

I downloaded training data and saw negative values in the partition field, what does this mean?

A

A partition value of “-2” means the target value was missing.

187
Q

I downloaded training data and with the partition field. I see holdout labeled, but which of the CV folds were used for validation?

A

The validation partition, although unmarked, is the largest partition, by number of rows, that isn’t the holdout partition. In the case of a tie, DataRobot chooses a partition randomly from those that were largest.

188
Q

What languages can I get downloadable scoring code? What are the drawbacks?

A

With CodeGen you get java, either source or binary. With DR Prime you get either java or python. CodeGen produced exact predictions for a model, but isn’t available for all models. DR Prime is an approximation to a model, but it can be used to approximate any model.

189
Q

When I add a deployment via a leaderboard model, what two choices do Ihave to make? Can I change them later?

A

You have to choose (1) if you want to set/change the prediction threshold and (2) you have to decide if you want DR to track data drift, which requires DR to store scoring data. (1) can not be changed, (2) can.

190
Q

What is the difference between response time and execution time?

A

When an API call is made, the time it takes to get a response back could be called “total time”. It is not possible for us to track this. Once the request is received by DR servers, we measure two things. Response time is how long DR spent processing a prediction request (receiving the request and returning a response. Execution time is the time DR spent scoring a prediction request.

191
Q

Where can I see the prediction volume for my deployment?

A

Service health captures several metrics related to prediction volume

192
Q

If I want to change a model in production, what do I do to my API code?

A

Nothing. The API calls the deployment in DR. The deployment points to a model, so you only need to make the change within the deployment inside of DR. Since your API will continue to reference the same deployment, no changes are needed.

193
Q

Why do I see accuracy tracked for some deployments but not others?

A

As of release 5.0, accuracy can only be tracked for external deployments, and it will only be tracked for external deployments that had actual values included in the uploaded scoring data.

194
Q

Why didn’t Cross-Validation automatically run on my DataSet?

A

The cutoff for cross-validation is a hard cutoff at 50,000 rows. If you require automatic cross-validation, use a dataset with 49,999 or fewer rows. You can also manually run cross-validation.

195
Q

Why doesn’t the confusion matrix on the ROC page for a model match my manual calculations based on the threshold I selected?

A

(Not customer-facing): To save time communicating with the front-end, DataRobot only calculates 100 thresholds for the confusion matrix. The threshold the user selects on the UI actually gets rounded down to the closest percentile, which results in a slightly different confusion matrix than what is expected with exact calculations using the desired threshold.

196
Q

Why isn’t (variable x) showing up on the feature fit exhbit?

A

“Feature Fit is computationally intensive, especially for datasets with many variables. We populate Feature Fit with variables in the order they appear on the data tab. This measure of importance is done using univariate ACE scores, and therefore won’t match the ““variable importance”” tab for a given model.

If your dataset has hundreds of columns, and the variable you are interested in is close to the bottom of the ““data”” tab, when sorted by importance, you may need to wait a few hours for feature fit to calculate for that variable, for a given model.

Also, we cap model xray at 500 variables max, so if a variable is not in the top 500 variables by ACE score, it will never show up in feature fit. Text features and the target will never show up in model xray. “

197
Q

What techniques does DataRobot use to blending ensemble models?

A

For a full list, see the docs, but a few of the blender types are average, median, ENET, and some tree-based and NN blenders.

198
Q

What is the maximum autopilot sample size?

A

In every environment (on-prem, and in the cloud), the autopilot runs in 3 stages. The third, or maximum stage is always <= 500MB. You can manually run models higher than that (up to 10GB on-prem, 5.0GB in the cloud)

199
Q

If you use 3 different learning rates in the hyperparameters, how many models do you end up with on the leaderboard?

A

You will end up with three models. Each learning rate will produce a different model in the leaderboard that you can compare directly.

200
Q

Why isn’t MAPE available on my project?

A

MAPE is only available when the response contains no zero’s or negatives.

201
Q

Why are reason code predictions different than those on the training data from the predict tab?

A

Prediction explanations from the prediction explanations tab are in-sample, so do not use them for anything important. Predictions on training data downloaded from the predictions tab are actually stacked predictions, and are out-of-sample. Use them for anything important.

202
Q

When calculating CV scores - are models tuned again

A

Yes - we don’t take the values found in the first split, but run the whole tuning again for each split

203
Q

When calculating predictions on a model through CV - which model do we use?

A

We use the model trained on the first CV split

204
Q

What is the maximum size a text field can be?

A

No hard upper bound, customers have used datasets with millions of characters and 100’s of thousands of words in a single field

205
Q

I have a large dataset I need scored, what are my options?

A

If your data is on a hadoop cluster, you should score it in place. If it is not on a hadoop cluster, then use our batch scoring script which will send it to the dedicated prediction server.

206
Q

What is the difference between Random Forest and Extra Trees Regressor Models?

A

The “ExtraTrees” model is a refinement of Random Forests, with more randomness: the splits considered for each variable are also random. This decreases the variance of the model but potentially increases its bias. The ExtraTrees models has an additional advantage in that it is computationally very efficient: no sorting of the input data is required to find the splits, because they are random.

207
Q

What is the difference between prediction and modeling servers?

A

Essentially, modeling workers are powering all the analysis you do from the GUI and from R/Python clients. These resources are typically used to build models, hence they are called “modeling workers”. If you have a model deployed and it’s calling for predictions in real time, if you were to send those prediction requests to the same resources used for modeling at the same time someone else is consuming those resources building models, then your prediction request would get stuck in a queue. For this reason, we have stand-alone resources called prediction servers which are used solely for making predictions.

208
Q

In Autopilot, what % of the training data is used for training in Rounds 1, 2, and 3, respectively?

A

The default is 16/32/64, however the final stage is always capped at 500MB. Round 3 will therefore run on 500MB/{dataset size}, round 2 will be 250MB/{dataset size}, and round 1 will be 125MB/{dataset size}.

209
Q

In which case we can not calc feature importance?

A

Feature importance is not calculated for multicass classification projects.

210
Q

What is the difference between Time Series and Out of Time Validation usage?

A

Out of time validation trains on ealier data, validates on later data, and extrapolates into the new unseen time values. Time series does all this, plus it takes panel data, detects and handles trends and seasonality, and derives numerous lagged features.

211
Q

What is the relationship between the prediction distributions, the confusion matrix, and the two thresholds you set in the “Prediction Distribution” interface of the ROC Curve?

A

The prediction distributions form the foundation for the rest of the exhibits on the ROC page. The prediction distributions show the output of the models, which are giving probabilities. The confusion matrix is a summary of the two distributions for a given threshold. (Positive distribution, above threshold and below, along with negative distribution, above threshold and below; these give the 4 quadrants of the confusion matrix.) . The ROC curve is a summary of the true positive and false positive rates off of each confusion matrix. The Cumulative Gain chart is then telling you, as you move the threshold from, say, right to left, what your true positive rate is for the records above the threshold. This is essentially measuring how well the model has aggregated the positive records at one end of the sort order, so it is a ranking measure, similar to AUC or Gini.

212
Q

What programming languages can I use in DataRobot?

A

DataRobot allows you to import and train your own R and Python models by using Jupyter Notebook directly in the application. You can access those development environments by clicking on the “Jupyter” tab at the top of the screen. Using those, you can also utilize R and Python standard libraries to manipulate and visualize the dataset or to interact with the leaderboard.

213
Q

Where can DR be installed/hosted?

A

DR is installed on our AWS cloud, an environment we maintain. DR can also be installed on your hardware, either on a linux server, a hadoop cluster or in your virtual private cloud, such as AWS, Azure, etc. Defer deeper questions to support and/or FE.

214
Q

What sources can DR ingest data from?

A

Local machine, URL, Hadoop, or JDBC

215
Q

What file formats can DR ingest?

A

The list is on the DataRobot data ingest page: .csv, .tsv, .dsv, .xls, .xlsx, .sas7bdat, .geojson, .bz2, .gz, .zip, .tgz. To summarize: text, excel, sas or various zipped files.

216
Q

What are DR’s file size limitations? What if the file is zipped?

A

“Up to 5.0GB in the cloud and on-prem (non-Hadoop)
Up to 10GB on-prem for all models (ONLY Hadoop, or AWS_VPC with S3 backend)
Up to 100GB on Hadoop for ScaleOut models, with non-scaleout models downsampled to 10GB

Dataset must have <20k columns. There is no max number of rows. Min number of rows depends if regression, classification, etc. For classification, 100 is the minimum number of rows.”

217
Q

What is the customer support portal?

A

DataRobot Support posts up to date information here on product releases, new features, and FAQs as well as known information on bugs or outages. Past product release notes are here as well.

218
Q

Why doesn’t the manage projects screen have folders?

A

The manage projects screen uses tags. The disadvantage of folders is that a file can only be in one folder at a time, while you can have many tags on a file. For example, with folders, you might create one folder per customer, but then finding all regression projects quickly would be a pain as you’d have to go through each folder. With tags, however, you can have customer tags, model type tags, etc, which makes it easy to searchj and filter by various criteria.

219
Q

What types of machine learning does DR do?

A

Regression, classification (2-100 classes), time series regression, time series binary classification and anomaly detection

220
Q

What are raw features, all features, informative features, univariate selections, DR reduced features, and ‘informative features - leakage removed’?

A

Raw features are the features that existed in the dataset you uploaded. All features is raw plus features derived from date variables, ie day of week, month of year, etc. Informative features is all features minus features that are trivially uninformative, ie those that have one value. Univariate selections appear after pressing the start button and they are the features that are most strongly correlated with the target on a univariate basis (using ACE scores). DR reduced features is a feature list created after modeling which DR creates from the most impact non-redundant features from the top model. ‘Informative feature - leakage removed’ is the informative features feature list with any features that have been identifed as target leakage removed.

221
Q

What do the “few values”, “duplicate” etc prefixes on feature names mean?

A

Non-informative features can be identified by these grey prefixes, which describe the reason why the feature is uninformative.

222
Q

We spend 80% of our time preparing data, so you’ve only automated 20% of my work?

A

(A) This question ignores an organization’s deployment costs, and (B) the “80” data prep consists of two steps: (1) a big SQL join to merge several datasets into one and (2) making that dataset model ready by encoding variables, cleaning up missing values, transforming features, searching for interactions, non-linearities, etc. DR does all this for you.

223
Q

Why are there models in the repository that didn’t get run?

A

Autopilot runs the models that will give a good balance of accuracy and runtime. Models that offer possibility of improvement at significant increases in runtime are held in the repository. It is a good practice to run autopilot, identify the algorithm that performed best on the data, then run all variants of that algorithm in the repository.

224
Q

What data is used to generate the lift chart?

A

The data will be out-of-sample,so either validation, CV or holdout. (unless the model is trained into the validation or holdout sets)

225
Q

Why are there two prediction thresholds shown on the ROC Curve page?

A

You will see two differen thresholds displayed on the ROC tab. You can change the threshold to experiment and look at different confusion matrices on the ROC tab, but doing so does NOT change the threshold used predictions are made. That threshold can also be set in the ‘Threshold set for Prediction Output’ section of the ROC tab (as well as in Deployments).

226
Q

What is the difference between density and frequency on the ROC Curve page?

A

The density chart displays an equal area underneath both the positive and negative curves. The area underneath each frequency curve varies and is determined by the number of observations in each class.

227
Q

What data is used in the ROC curve?

A

The data will be out-of-sample,so either validation, CV or holdout. (unless the model is trained into the validation or holdout sets)

228
Q

What is the difference between feature fit and feature effects?

A

These two exhibits show the same thing with two minor differences. First, the variable order shown on the left is determined by ACE scores in the feature fit exhibit but is shows by feature impact in the feature effects exhibit. Second, by default partial dependance is turned off in feature fit (though it can be turned on by the user), while actual and predicted are turned off by default in feature effects (though it too can be turned on.) Feature fit is computed using pre-modelling metrics (the importance score), whereas feature effects uses the feature impact score, which is determined after the model is run.

229
Q

Why are my text variables not showing up in feature fit (or feature effects)?

A

Because there are so many unique words and ngrams in free-form text, they cannot be shown in a graph the way other variables can. Even the top few words often show up in a very small percentage of the rows, so there would be very little data if we were to show the top few variables the way we do with categorical.

230
Q

Records from what data partition are returned on the “prediction explanations” page?

A

These are records from the validation partition.

231
Q

What does the “ID” represent on the prediction explanations page?

A

The number in the ID column is the row number ID from the imported dataset.

232
Q

Will my models improve if I add more observations to my training data?

A

Learning curves are designed to answer this question. As more observations are added, a model’s performance will improve initially and then begin to level off. This is important for anyone that says they have more data than DR can handle. Second, it’s important to distinquish what happens if you add more columns vs more rows to your training dataset. Often we get the question about “more data”, but the answer very much depends on if the additional data is features or observations.

233
Q

Why don’t I see all models on the learning curves?

A

Not all models show three sample sizes. As DataRobot re-runs data with a larger sample size, the software only applies it to the highest scoring models from the previous run. (Note, if you re-order the Leaderboard display, for example to sort by cross-validation score, the Speed vs Accuracy graph continues to plot the top 10 validation score models.)

234
Q

Where can I see how DR used the text variables in my training data?

A

Word cloud and text mining in the insights tab. You will probably want to see how the models performed that used the text (not all blueprints use text). To find them, search the leaderboard for ‘text’.

235
Q

Where can I see coefficients?

A

Any model that produces coefficients can be identified on the leaderboard with a “beta” tag. Alternatively, you can find coefficients collected into one place on the insights page.

236
Q

What is greybox?

A

Greybox, or censored blueprints, is a way for us to protect our IP. It determines whether our blueprints show the detail of all preprocessing steps or not.

237
Q

Is there documentation for the hyperparameters?

A

Yes - All hyperparameters for each specific algorithm are documented in the “DataRobot Model Docs.” The link to any specific model is accessible by clicking the model box in the blueprint.

238
Q

What’s the minimum dataset size required for modeling with DataRobot?

A

For regression projects, 20 data rows plus header rows. For classification projects, 100 data rows plus header rows.

239
Q

What is the name of our company? Data Robot, Data robot, DataRobot, Datarobot, or DATAROBOT!!! ?

A

DataRobot is the correct spelling. It is important not to use other spellings for branding purposes, but using “Data Robot” is particularly bad because this is viewed differently from “DataRobot” in online searches, etc.

240
Q

When downloading batch predictions of the training data, how can you see the original correct labels in the prediction download (csv file)?

A

You can add up to five features to the predictions when you download, so simply add the original target.

241
Q

Outline a streamlined workflow as if for a client, in seven clear bullets or less.

A

Data collection, setup, modeling, evaluate, interpret, deploy, monitor. Of course this is an interative process.

242
Q

Provide three examples of “guardrails” that DataRobot provides to guide the user and ensure the usability of the model produced.

A
  1. Highlighting of Data Quality Issues
  2. Automatic handling of missing values
  3. Suggesting problem type (regression vs classification)
243
Q

What does DataRobot mean by blueprint versus model, and why is this distinction significant to the overall concept of our product?

A

A modeling algorithm fits a model to data, which is one part of a blueprint. The blueprint also consists of data preprocessing. This is a vital difference; often customers at first glance will say “It looks like I still have to prepare my data for modeling, I spend 80% of my time doing that today, you’ve only automated the other 20%”. This is not accurate because the 80% the customer is referring to consists of two parts: (a) a big sql join to create the flat file (typically a fairly easy process) and (b) making that flat file model ready, ie encoding categoricals, transforming numerics, imputing missing, parsing words/ngrams, etc. and this process of making the data model ready also depends on which algorithm you’re going to use. With DR, you still create the flat file, but you do not need to make it “model ready”, DR does that for you, and it does it for each different algorithm, oftentime multiple approaches for one algorithm. (Not to mention that this 80/20 argument also completely neglects deployment costs to an organization.)

244
Q

Why might a client prefer to be on-prem or on the cloud?

A

Clients often prefer on-prem when they want to control security, or if they have a hadoop cluster. The security concerns could be their own corporate goverance or it could be regulatory requirements, such as HIPAA. With an on-premise install, the customer maintains the environment (though our support team is great helping even with this stuff that’s normally not our responsibility when they can), whereas with our pubic cloud, we maintain the environment, which is less overhead for the customer to deal with.

245
Q

What is the benefit of having many model workers?

A

With more modeling workers, you can build more models in parallel. If you are building models in 2 or more projects, you can allocate worker between them. For example, if you have 10 modeling workers at your disposal, you can crank both projects up to ten, but now they are competing with each other for the next available worker. You could allocate 5 workers to one project and 5 to the other, ensuring that each project has workers.