Modeling Process Flashcards

Question

What is Fast EDA vs EDA1 vs EDA2?

Answer 1

EDA1 happens when the data is initially ingested. This is done on the full dataset (or a 500MB sample if dataset > 500MB). EDA1 determines feature type, summary statistics, frequency dist for top 50 items, and identifies informative features. EDA2 is done on the same datase as EDA1 but excludes holdout and any rows where target is missing. EDA2 recalculates summary stats and calculates ACE scores. Fast EDA applies to datasets over 5MB with <10k columns, and it simply shows preliminary EDA results based on the uploaded subset of data; when the upload is complete, EDA1 calculates normally and all EDA results reflect the full EDA1 process.

Answer 2

By default, DataRobot creates a 20% holdout and five-fold cross-validation

Answer 3

Runs XGboost models with a lower learning rate and more trees, as well as an XGBoost forest blueprint.

Answer 4

Weights are used to control how much influence each record has in model fitting.

Answer 5

ACE scores, or "Alternating Conditional Expectations" scores, measure a features correlation with the target. ACE scores are capable of detecting non-linear relationships but as they are univariate are unable to detect interaction effects.

Answer 6

ACE scores, or "Alternating Conditional Expectations" scores, measure a features correlation with the target. ACE scores are capable of detecting non-linear relationships but as they are univariate are unable to detect interaction effects.

Answer 7

The asterisks mean essntially that the scores are evaluated on in-sample training data.

Answer 8

FE and FF show a missing value for numeric variables so that you can see the effect of scoring a record with a missing value in that field. For categorical variables, the mode is used when missing values are present, which is the value with the biggest bar in the histogram.

Answer 9

Feature Impact uses up to 2500 rows selected from the training partition via smart sampling. Smart sampling tries to make the distribution for unbalanced binary targets closer to 50/50 and adjust the sample weights used for scoring, similar to smart downsampling.

Answer 10

DR identifies the most accurate non-blender model and prepares it for deployment with three steps. FIrst, DR calculates feature impact. Second, DR retrains the model (on the same partition the last model was trained on) on a reduced feature list. Third, for non-time aware models, DR takes the better of the two models (original model and reduce feature list model) and retrains it on data including the validation partition (if doesn't exceed autopilot size threshold). For time-aware models, DR retrains on the most recent data.

Answer 11

These both show the sme information, in different formats. Text mining shows coefficients in a bar graph format. Word cloud shows the normalized version of those coefficients in a more creative format.

Answer 12

Stop words are the most common words that often have no value in text modeling, ie words like "the", "at", "and", "of", etc. Filtering stop words removes then from the word cloud.

Answer 13

Both are downloadable scoring code. Codegen is not available for all models, but for those it is available for, it allows you to download java code that will match API predictions exactly. DR Prime is a model that is run to approximate another model and DR prime allows you to download python or java scoring code, but as this model is an approximation to another model, the predictions returned won't match exactly. DR Prime is a good option when the model you want to deploy doesn't support CodeGen but you need scoring code. Neither CodeGen nor DR Prime give prediction explanations.

Answer 14

(1) GUI - simple but uses modeling workers. (2) dedicated prediction server via API - fast, supports prediction explanation, but requires some coding either for API call or batch scoring script (3) Scoring code - fast but no prediction explanations. (4) hadoop in place scoring, brings models to data rather than moving the data to the models, but prediction explanations not available (confirm)

Answer 15

When you change the number of rules, DR refits the rulefit classifier using the number of rules that you choose. You might want to change the number of rules if, for example, decreasing the number of rules leads to a simpler and easier to understand model while only suffering a minor decrease in accuracy.

Answer 16

This button allows you to upload prediction data and optionally training data for a model built outside of DR. This allows you to assess model performance via DR model management capabilities.

Answer 17

This is done within the deployments section, see the docs. And if the replacement model differs from the current model—because of either features with different names or features with the same name but different data types—DataRobot issues a warning.

Answer 18

Service health tracks basic functionality of the API pipeline, it does not evaluate model performance in any way. It tracks things like errors, latency, volume, etc.

Answer 19

We capture the percentage of requests that returned a prediction request error (4xx) or that returned a server error (5xx).

Answer 20

The percentage of requests that used a cached model (the model was recently used by other predictions). If not cached, DataRobot has to look the model up, which can cause delays. The prediction server cache holds 16 models by default, dropping the least-used dropped when the limit is reached.

Answer 21

Model management is about monitoring your models once deployed for any evidence of problems. Those problems could be techical in terms of latency or errors, or they could be that the data you're scoring is very different from the data used to train the model. The latter doesn't necessarily mean there is a problem, but it is something you want to look into and be aware of. The Deployment object was introduced in 2018 to make it easier to change models used in deployment. Prior to this change, deploying a model via API required the API call to be embedded in the customer's systems, and the API pointed to a model in DataRobot. If you later wanted to change the model used to power those predictions, you had to change the parameters of the API call, which usually would mean getting IT resources involved. With the introdiuction of the Deployment object, the API still requires IT resources for initial setup, but the API points to a deployment, not to a model. The Deployment then points to the model. This means that if you want to change the model used in deployment, you now do it by pointing the Deployment at a different model, which you do from within DataRobot, and you do not have to involve IT as no modifications to the API are needed.

Answer 22

The data needs to be a CSV or JSON file

Answer 23

The color coding on the main deployments dashboard gives an overview of all models performance and is not modifiable

Answer 24

By default, the drift threshold defaults to 0.15. The Y-axis scales from 0 to the higher of 0.25 and the highest observed drift value. If you are the project owner, you can click the gear icon in the upper right chart corner to change these values.

Answer 25

A feature list is analogous to a playlist from a larger music library. It is a list of features, and like a playlist, you can have as many feature lists as you might want as long as those features exist in the larger 'library'. The primary purpose for feature lists is to tell DR which features to use for modeling. But feature lists are also used to specifiy features that need to be monotonically increasing or decreasing.

Answer 26

A frozen run is created when you retrain a model on an increased sample of data but you leave the hyperparameters frozen from the prior run. Hyperparameters control how rigid or flexible the model is when it is fitting to data. One of the reasons you might do this is to save time fitting the model, particularly for larger datasets. Another is for regulation. You may need to ensure the same approved hyperparameters are applied when retraining on a larger set of data.

Answer 27

Rating tables are generated by Generalized Additive Models. They look and feel very much like the output of a GLM: an intercept along with multiplicative coefficients. These were added to DR to support the insurace industry as this is the format traditionally used for pricing plans.

Answer 28

"Creating CV and Holdout partitions" this is actually partitioning the data into the different folds for model evaluation and scoring.

Answer 29

Currently, DataRobot will recognize a feet/inches length (such as 15' 9") and convert these automatically to inches, treating as a numeric (so changing to 189). This is the only time you will see a "Length" type feature.

Answer 30

If you select AUC, the models on the Leaderboard will be sorted by AUC, grid search will be done via AUC, and feature impact will be via logloss. Also, the models themselves will use their own optimization metric (e.g. gini or entropy for a RF, logloss for elastic net)

Answer 31

DataRobot will not allow you to delete a model that is deployed.

Answer 32

This indicates a frozen run, which means the model is a retrained version of another model, where hyperparameters from the other model are frozen and the model is simply retrained on more observations.

Answer 33

Currently (May 2019) this functionality does not exist in DR.

Answer 34

Data Drift is referring to changes in the distribution of prediction data vs training data. If you see Data Drift alerts, it's telling you that the data you're making predictions on looks different from the data the model used to train. DR uses PSI or "Population Stability Index" to measure this. (This is an alert that you want to look into, perhaps you need to retrain your model to be better aligned with the new population.) . Model's themselves cannot drift, once they are fit, they are static. However some might use the term model drift to refer to drift in the predictions, which would simply be an indication that the average predicted value is changing over time.

Answer 35

The color of the rule indicates the relative difference between the average target value for the group defined by the rule and the overall population.

Answer 36

"First, ACE scores are a univariate measure of correlation between each feature and the response. These are not related to any models. They capture non-linearities but as they are univariate, they do not measure predictive impact in interactions. Feature Impact is calculated AFTER a model is done fitting. It perturbs the dataset and then uses the model to make predictions, measuring the overall impact on accuracy from each perturbation. This this method directly measures each features complete predictive power, and can be applied to any model. While this is the best direct measure of a feature's predictive power, it can take hours if the dataset is large or has many variables. Tree-based variable importance is available for trees only, much like coefficients are available for linear models. Tree-based variable importance measures a variable's importance indirectly by measuring how the variable is used in the tree (often vs infrequently, etc.). While this doesn't measure each feature's predictive power directly, it is close and it's available immediately when the model is complete, unlike feature impact which can take hours to run."

Answer 37

The pros are that it speeds up runtime. Because we put a weight after downsampling, the resulting downsampled dataset essentially still retain the same class balance when modeled. When you downsample, you randomly choose 1 record to represent n records. The assumption is that the 1 you kept is representative of the n-1 you discarded, or said differently, the sample you kept is representative of the population in your dataset. This is a safe assumption for large datasets, but the more features you have, the more complex the features, or the more noisy the target, the smaller your 1:n ratio should be.

Answer 38

Users cannot modify blueprints, which is where pre-processing is found. However, often a desired preprocessing step not found in one blueprint may be found in another. If the user has preprocessing they want done that they don't find in a blueprint, they should do this preprocessing outside of DR and load the processed dataset in. This becomes seamless when using the R/Python clients.

Answer 39

You will not have to wait. As of May 2019, the queue logic was modified and feature impact calculations are highest priority and this get processed with the first available workers.

Answer 40

This represents the result you'd see theoretically if your model were randomly guessing with each prediction.

Answer 41

The phase "learning curve" is used to describe a few different things, but they all relate to how a model improves. Learning curves are often used to show model accuracy (both in-sampel and out-of sample) when tuning hyperparameters. Another variant on learning curves, and this is the one shown in DR, is showing how a model's performance improves as the model is trained on an increasing number of observations. This is useful to know because often the question will come up as to whether it would be worth training the model on more data or not.

Answer 42

This indicates that the feature has been derived, either by DR or by the user. DR automatically derives date-related features from dates, e.g. day of week, month of year, etc, and these are indicated with the 'i icon'. If a user creates a new feature via the "var type transform" functionality, or via the "create f(x) transform", the icon identifies these features as well.

Answer 43

This very much depends on the problem. Any feature transforms that can be thought of outside of the context of a specific problem are likely already built into DR. Only a few things to mention. (1) if a numeric variable has a partial dependance (from a top model) that looks like a step function, it could be worth creating a new categorical feature which maps the numeric to each of the steps. The idea is that while the top model found that (non-linear) pattern, other models may not have, so by giving the other models that variable as a binned categorical, you give the other models a chance to detect that relationship. (2) sometimes customers ask for interactions beyond pairwise. GA2M models show meaningful 2-way interactions, so if you encode these 2-way interactions in the dataset and then run it through DR in a new project, now the GA2M models can interact those 2-way variables with other variables, and detect 3 and 4-way interactions.

Answer 44

DataRobot automatically detects numerical, categorical, dates, percentages, currencies, lengths, and unstructured text. An easy way to convert a numeric to a categoric is to add a letter

Answer 45

By default, DR excludes outliers from the histograms. Pressing this button will then give you a toggle where you'll be able to toggle between the histogram with and without outliers. Note the histogram bins will likely change as you toggle.

Answer 46

Stacked predictions are essentially a safe way of making predictions on training data. We cannot use the final model to make predictions on the training partition because the data would be in-sample and would lead to overly optimistic predictions. Stacked predictions come from the model's internal cross-validation.

Answer 47

Contact support@datarobot.com, your CFDS, FE, AISM, or AE. Alternatively, use the "blowhorn" icon in the top right of the app.

Answer 48

Contact your CFDS or FE or use the "blowhorn" icon in the top right of the app. Alternatively, can email AISM, AE, or support@datarobot.com

Answer 49

Contact your CFDS or FE or use the "blowhorn" icon in the top right of the app. Alternatively, can email AISM, AE, or support@datarobot.com

Answer 50

No, from the manage projects page, click the 'hamburger' symbol and select 'copy project'. This will make a new instance of the project, but will bring you back to the setup screens. Any feature lists created in the original project persist in the cloned project.

Answer 51

When you click the model, the project is now in that model! Go to the data or models screen to view that project.

Answer 52

Weights are specified in Advanced Options. One column in your dataset will contain the weight that you want models to put on each record. Weights are used to control how much influence each record has in model fitting. This is not to be confused with optimization metrics, which will also change the amount of influence each record has in model fit. For example, changing from RMSE to gamma deviance in a regression model will cause records will very large response values to have less influence in the model fit, but for different reasons. In this example with the optimization metric, the gamma deviance metric is built on the assumption that larger response values are associated with larger variance, ie more noise, and the models fit with gamma deviance therefore put a premium on fitting smaller values, not larger ones, whereas models fit using RMSE try to fit all values equally. Contrast this with weights; for example, putting a weight of 2 on a record has the same effect has having that record in the dataset two times.

Answer 53

Offsets and exposures are commonly used for insurance loss modeling. See the links for more information, but call in a SME to help with this.

Answer 54

DR will give you up to ten

Answer 55

1GB, as stated in the GUI on the batch predictions tab.

Answer 56

Yes - you can manually transform individual predictors using a number of built-in or user-defined mathematical functions.

Answer 57

From the manage projects screen

Answer 58

Class imbalance is an issue if we evaluate the models using simple metrics like accuracy %. However, DataRobot directly optimizes the models for objectives that are both aligned with the project metric and robust to imbalanced targets (such as logloss).

Answer 59

The high level workflow is to create a feature list containing the features you want to be monotonically increasing (and another list for decreasing). In Advanced Options you give DR these feature lists in the monotonicity constrain section. These feature lists will be a subset of the feature list you use for modeling. https://app.datarobot.com/docs/modeling/analyze-models/describe/monotonic.html

Answer 60

The metric you set in Advanced Options prior to modeling is optimized, however once the models are fit, those models can be evaluated using any metric. When you change the metric on the leaderboard, you are asking DR to evaluate each model with that metric, which is very different from telling DR to fit the model to optimize the metric.

Answer 61

k models are built, each validated on a different CV fold. To score fold k, we use model k, which was built on data that excluded fold k. This means that multiple models are being used to create the lift chart on CV data.

Answer 62

Actual is the average actual response, predicted is the average predicted response. Partial dependance is computed for a particular feature by setting all rows to the same value for that feature and computing the average prediction, then iteratively doing the same for each possible value of the variable. This shows what happens to the average prediction as the value of that variable changes.

Answer 63

Feature impact is calculated with a technique sometimes called "permutation importance". It is calculated AFTER a model is built and it is a technique that can be applied to any modeling algorithm. The idea is to take the dataset and 'destroy the information' in each column (by randomly shuffling the contents of the feature across the dataset), one at a time, make predictions on all the resulting records and calcualte the overall model performance. The permuted variable that had the largest impact on model performance is the most impactful feature, and so forth.

Answer 64

The best way to determine this is to test it in your environment.

Answer 65

DR first identifies the most accurate non-blender mode and then prepares it for deployment; the resulting prepared model is labeled "Recommended for Deployment"l. The rationale for this is that non-blenders are faster to score than blenders are. Prediction latency isn't a concern for some applications so the user should understand this and choose accordingly.

Answer 66

Tree-based variable importance, at a high level, considers how/where a variable is used: how often and in which parts of the tree. The tree based variable importances generally use a node impurity measure (gini, entropy). This measure can be biased towards variables with a lot of categories. A better approach is to use a permutation method like feature impact. Tree-based variable importance is only available for tree-based models, but it is available immediately after the model is finished. You do not have to wait for it as you do with feature impact, and this could be a big benefit if you have many features as that sometimes can take overnight to finish running.

Answer 67

FI is an approach that can be applied to any model and uses permutation importance. It is a direct measure of a feature's impact on predictions, but it does require computation after model building that can take some time. Tree-based variable importance is only available for tree-based models, and it is a proxy for the impact a variable has on predictions, but it is available immediately when the models are finished fitting.

Answer 68

You either make predictions via a POST request (API call) or via the batch scoring script. (Note, if you're making predictions from R or Python and you're doing it USING OUR PACKAGES, then you are not hitting the prediction server, you're using modeling workers.)

Answer 69

DR by default uses a 20% holdout and 5-fold CV with stratified sampling. There are several different methods that allow you to separate your training data into different roles while maintaining awareness of different 'groups' in your data.

Answer 70

How does DataRobot interact with SAS or what can I do with my SAS models?

Answer 71

On a high level if it is a binary classification problem we always optimize logloss. If its regression we start with RMSE unless data is very skewed than we lean toward Poisson or Gamma . We use Tweedie if the distribution is Zero Inflated.

Answer 72

DataRobot trains on 16/32/64 as part of it's autopilot, but it will start higher than 16 with smaller datasets. The models recommended for deployment (most accurate non-blender) is then retrained at 80%.

Answer 73

Since DR does data preprocessing and feature engineering, the data that is fed into the models will be derived from but different than the data the user uploaded. While you cannot access this derived dataset currently (May 2019), you can see which variables were used by the models. In the insights tab, both Tree-Based Variable Importance as well as the Variable Effects sections show the derived variables used by the models.

Answer 74

"There are two thresholds on the ROC tab: (a) Threshold - This is interactive and by default set to the threshold that maximizes F1 score. Note that this does not impact predictions, this is solely for analysis in the GUI. (b) Threshold used for predictions - this by default is set to 0.5 and must be set by the user. This is the threshold used when DR returns predictions. (DR predictions consist of both probabilities as well as a y/n classification, and it's this classification that uses this threshold.) "

Answer 75

You can do almost everything via our R and Python clients that you can do with the GUI.

Answer 76

Yes, when you share a project you can make the person an owner, a user, or an observer. Observers can only observe. Users can do everything the owner can do except delete the project or unlock the holdout.

Answer 77

In the window where you specify the target, simply click "switch to classification". This option will only be enabled if the numeric feature has no more than 100 unique values.

Answer 78

You can launch a batch run.

Answer 79

Yes, via the 'export' button

Answer 80

Yes, 10, 12, 15, 20, 30 or 60 bins

Answer 81

No, this is intentionally omitted.

Answer 82

No, it is only for binary classification.

Answer 83

You will see two differen thresholds displayed on the ROC tab. You can change the threshold to experiment and look at different confusion matrices on the ROC tab, but doing so does NOT change the threshold used predictions are made. That threshold can also be set in the 'Threshold set for Prediction Output' section of the ROC tab (as well as in Deployments).

Answer 84

Yes, via the 'export' button

Answer 85

Yes, you can tune model hyperparameters in Advanced Tuning which is found on the Evaluate menu within a particular model. The recommendation is that often it's better to spend your time doing feature engineering than tuning hyperparameters.

Answer 86

To see which are most strongly correlated with the target on a univariate, ie non-modeling basis, look at ACE scores. To see which features are most important according to a particular model, look at feature impact.

Answer 87

After you build models, you can use Prediction Explanations to help understand the reasons DataRobot generated individual predictions.

Answer 88

The Speed vs Accuracy display is based on the validation score, using the currently selected metric

Answer 89

There are many things you might want to compare between two or more models, but a good first place to look is the model comparison exhibit.

Answer 90

Yes, simply press the export button

Answer 91

(1) Batch predictions via the GUI or R/Python clients, (2) Predictions via the API, either (a) real time or (b) in batch with our batch scoring script, (3) downloadable scoring code either (a) codegen or (b) DR prime, and (4) in-place distributed scoring on Hadoop.

Answer 92

Via the GUI, you can download up to 5 additional columns with predictions, and you can get predictions for every row in the training dataset. "Stacked predictions" are returned for records that were in the training partition.

Answer 93

Yes, for any individual model, go to the predict tab and the downloads section, then there is an option to download all charts for that model.

Answer 94

Yes, DataRobot has various options for data preparation. There are some data preparation abilities built directly into the core application. With the release of 5.3, AI catalog has also added substantial ETL capabilities. Additionally, in late 2019 DataRobot acquired Paxata to bring a robust set of automated ETL functionality to the DataRobot platform.

Answer 95

No. Class imbalance is an issue if we evaluate the models using simple metrics like accuracy %. However, DataRobot directly optimizes the models for objectives that are both aligned with the project metric and robust to imbalanced targets (such as logloss). If the project metric is different (e.g. AUC), it is used afterwards to fine-tune the hyperparameters of the models.

Answer 96

Out of time validation, or date-time partitioning, is an alternative to TVH. WithTVH,you randomly choose some records to train on and some records to test on. With OTV, you choose to train on records from earlier periods in time and then validate on records from later periods in time.

Answer 97

Technically, "Lift is a measure of the effectiveness of a predictive model calculated as the ratio between the results obtained with and without the predictive model." This is derived from the Cumulative Gain chart, not the lift chart. The Cumulative Gain chart, though it looks like an ROC curve and is based on the same underlying data, should not be conflused with or compared to the ROC curve, they show very different things. Lift is the ratio of points on the Cumulative Gains plot to points on the 45-degree or identify plot. The ratios of these points create the cumulative lift chart.

Answer 98

In Advanced Options -> Additional -> Positive Class Assignment

Answer 99

Only Extreme Gradient Boosted Trees, Generalized Additive Models and Frequency/Severity (both frequency and severity model based on XGBoost) and Frequency/Cost model (both frequency and cost model based on XGBoost) support training with monotonic constraints.

Answer 100

"No, but DataRobot chooses from a comprehensive set of metrics and recommends one well suited for the given data and target. Users can change the metric from the range of choices provided. Users can also calculate accuracy on training data using any user defined metric post optimization. If customers request a particular metric, this can be shared with our product team who will then evaluate and prioritize the request. This recently resulted in the KS metric being added to DataRobot. "

Answer 101

There are many metrics for assessing binary classification models, not all of them are available inside of DR. Often these can be calculated by downloading the data from DR, for example Cohen's kappa can be calculated using the exported ROC curve data.

Answer 102

FE and FF are available for the top 500 features (using ACE scores in the feature fit exhibit, using feature impact for feature effects.)

Answer 103

The graph shows the top 30, but the top 1000 features are available via export as csv.

Answer 104

Mean target is the mean value of the target for records that satisfy the given rule. Mean Relative to Target is simply Mean target divided by the average target value for all records. % observations is the % of observations that satisfy the rule.

Answer 105

The size of the spot indicates the number of observations that follow the rule. The color of the rule indicates the relative difference between the average target value for the group defined by the rule and the overall population

Answer 106

This is the purpose of the dedicated prediction server. When you submit a dataset to DR through either the GUI or via the R or Python client, you are submitting to modeling workers. If those workers are busy building models, your prediction job will be queued. Since the prediction server only makes predictions and is sized according to customer needs, it will virtually never have a queue. You have to either make a POST request (ie not use the DR clients), or use the batch scoring script, which is a wrapper around the API call.

Answer 107

Yes. Prediction requests larger than 5 MB will not be included in data drift statistics, but will be included in service health statistics. When using the batch scoring script, keep the batch size below 5 MB to ensure data drift statistics are captured

Answer 108

Yes, this is displayed in the overview section for a specific deployment

Answer 109

The report does not include time due to network latency.

Answer 110

Drag the blue slider at the top of the service health screen to narrow down a time frame.

Answer 111

To monitor accuracy, predicted outcomes must be compared to actual outcomes. DR currently (release 5.0) has the ability to track accuracy for external model.

Answer 112

This graph plots the top 10 features (minus any text/percentage/currency features) and the target so that you can see if the distribution of impactful features in the scoring data is significantly different from the distribution of the feature in the training data.

Answer 113

Yes,by toggling the version selector you can see drift experienced with different models (ie over different periods of time.)

Answer 114

No, DR monitors drift for the ten most impactful features

Answer 115

This graph plots the top 10 features minus any text/percentage/currency features, so you can end up with less than ten.

Answer 116

You can find this in the integrations section of the deployment.

Answer 117

Yes, the prediction app can be launched from an existing deployment. There are several other types of applications (in beta or GA) to suit different needs.

Answer 118

No, with a dataset that is too small (<2000 records), attempting to automatically idenfiy reference ID columns gives too many false positives. With small datasets, the user needs to manually pre-process the data to remove reference IDs or create a feature list that excludes them.

Answer 119

No, DR models handle class imbalances without the need for upsampling. Class imbalance is an issue if we evaluate the models using simple metrics like accuracy %. However, DataRobot directly optimizes the models for objectives that are both aligned with the project metric and robust to imbalanced targets (such as log-loss). If the project metric is different (e.g. AUC), it is used afterwards to fine-tune the hyperparameters of the models.

Answer 120

To create an ordinal variable, the levels of the variable must be mapped to numbers indicating the order of the levels. Note that doing so imposes not just an order but also a distance between levels that may not exist. For exampe, ordering 'good', 'better', 'best' as 1, 2, 3 might make sense. But, does 'poor', 'fair', 'average', 'good', 'excellent' deserve 1,2,3,4,5, or would 1, 3,4,5, 7 make more sense? That is, someone may change a good rating to excellent with little thought, but they may require really compelling evidence to make the leap from good to excellent. Thus the 'distance' between these is not uniform. Assigning numbers gives both order and distance.

Answer 121

The default partitioning is random for regression and stratified for classification, but other appropriate partitioning methods are possible. For time dependent data, users can select Date/Time partitioning aka OTP (Out of Time Partitioning.) Column-based partitioning (Partition Feature) or Group partitioning (Group) can be used to create a more deterministic partitioning method.

Answer 122

DataRobot doesn't "select" blueprints, it's creates them dynamically based on the dataset that you give it and the target you specify! This process is where most of DR's data scientists have spend 7+ years embedding their data science knowledge and best practices. Roughly several dozen blueprints are created for each project, because although various approaches are viable, you don't know which ones will actually perform the best, so DR let's them compete wtih one another on the leaderboard.

Answer 123

You may not have missing values in your modeling dataset, but you may get missing values in scoring data at prediction time, so we show you this to show you the effect those will have. We calculate this the same as the other values, that is, we set all values equal to missing and calculate the average prediction.

Answer 124

Feature impact uses a (smart) sample of 2500 records and then applies the permutation importance approach, ie shuffles contents of columns one at a time.

Answer 125

DR fully encodes categoricals when it one-hot encodes, so in this case it will produce N indicator or dummy variables.

Answer 126

Yes, smart downsampling can be used on a regression problem in which the response is zero-inflated, aka Tweedie distributed, which means a large proportion of the records have a response value of zero.

Answer 127

By default, DR will assign the second value when sorted in alphabetical order as the positive class. If you load a dataset with a target of {1,0} or {Yes, No}, or {True, False}, the positive class in each case is 1, Yes and True respectively. These happen to be second when sorted in alphabetics order. So if your target has {a,b}, b will be used as the positive class by default. This can be changed in advanced settings.

Answer 128

(See link for description of the three modes.) Modeling projects often require iteration. Autopilot takes longer to run but it is the most powerful, while quick and manual take less time. I recommend users start DR with autopilot on the first iteration. They make observations and get ideas to make improvements, ie feature engineering, joining new features, etc. For the next few iterations, I recommend running manual mode and refitting only those blueprints that performed the best on the initial autopilot run. After a few iterations, ie after the dataset has been modified and enriched fairly significantly, I'd re-run autopilot again, as now other algorithms might do better than they did initially. I'd repeat this process; autopilot every 5th run, manual in between, to help speed up the modeling process. (Unless time is no concern, then just run autopilot everytime!)

Answer 129

You can set up a workflow to link together multiple models and their output by using the modeling API. Combining the predictions generated by these models allows you to avoid assumptions. For example in a frequency x severity model, you can avoid the correlation between frequency and severity that are necessary when using a single model based on Tweedie variance. Exporting the predictions for a dataset from within DataRobot lets you look at the residuals and perform analyses.

Answer 130

Yes, but only onprem. (Connecting our public cloud would require making database accessible to anyone.)

Answer 131

The DataRobot community content is being actively created (Q1 2020) to support self-service customers. New content is being constantly created and posted. There is a CFDS squad (Sustainable Success) dedicated to this effort along with many others in the company. DataRobot offers in-depth hands-on training through DataRobot University. Plus, our CFDS's are more than happy to help provide hands-on support via Webex or in person.

Answer 132

As of Jan 2020, we have over 200 data scientist and machine learning engineers at the company, in addition to over 400 software engineers. We have a wide variety of backgrounds in these groups, but over 20 of our employees (including our founders) were kaggle masters and grandmasters at one point.

Answer 133

It's important to remember that the real world situation that you are modeling is infinitely complex, and any model we build is an approximation to that complex system. Each model has it's strengths and weaknesses, and different models are able to capture varying degrees of that underlying complexity. For example, a model that is not capable of detecting non-linear relationships or interactions will use the variables in a certain way, while a model that can detect these relationships will use the variables differently, and so you will get different feature impacts from different models. Feature impact shouldn't be drastically different, however, so while the exact ordering will change, the overall inference is often not impacted. Collinearity can also impact this. If two variables are highly correlated, a regularized linear model will tend to use only one of them, while a tree-based method will tend to use both at different splits. So with the linear model, one of these variables will show up high in feature importance and the other will be low, while with the tree-based model, both will be closer to the middle.

Answer 134

Speed is showing the time it takes for the model to score 2000 records in milliseconds. Most importantly, it is NOT measuring round-trip API call, i.e. network latency. This may be of interest, and this must be tested in the customers systems.

Answer 135

Hotspots can be used this way, but it depends on what the customer wants the clustering to tell them or what they plan to use it for. Also, be aware that there are many overlapping 'clusters' and you'd probably only want to use disjoint clusters.

Answer 136

Word Cloud insights, whether accessed via the Leaderboard or the Insights tab, are not available when a text mining model is trained into the validation set or at 100%.

Answer 137

A partition value of "-2" means the target value was missing.

Answer 138

The validation partition, although unmarked, is the largest partition, by number of rows, that isn't the holdout partition. In the case of a tie, DataRobot chooses a partition randomly from those that were largest.

Answer 139

Nothing. The API calls the deployment in DR. The deployment points to a model, so you only need to make the change within the deployment inside of DR. Since your API will continue to reference the same deployment, no changes are needed.

Answer 140

You will end up with three models. Each learning rate will produce a different model in the leaderboard that you can compare directly.

Answer 141

If your data is on a hadoop cluster, you should score it in place. If it is not on a hadoop cluster, then use our batch scoring script which will send it to the dedicated prediction server.

Answer 142

The default is 16/32/64, however the final stage is always capped at 500MB. Round 3 will therefore run on 500MB/{dataset size}, round 2 will be 250MB/{dataset size}, and round 1 will be 125MB/{dataset size}.

Answer 143

Feature importance is not calculated for multicass classification projects.

Answer 144

From the leaderboard

Answer 145

From the feature lists screen

Answer 146

In the top right, click the "folder" and then "show more".

Answer 147

There is no such list. DR creates all blueprints dynamically specifically for the dataset you give it

Answer 148

Autopilot starts at 64%,32% or 16% depending if there are <2k rows, 2k-4k rows, or 4k+ rows.

Answer 149

It's important to remember that the real world situation that you are modeling is infinitely complex, and any model we build is an approximation to that complex system. Each model has it's strengths and weaknesses, and different models are able to capture varying degrees of that underlying complexity. For example, a model that is not capable of detecting non-linear relationships or interactions will use the variables in a certain way, while a model that can detect these relationships will use the variables differently, and so you will get different feature impacts from different models. Feature impact shouldn't be drastically different, however, so while the exact ordering will change, the overall inference is often not impacted. Collinearity can also impact this. If two variables are highly correlated, a regularized linear model will tend to use only one of them, while a tree-based method will tend to use both at different splits. So with the linear model, one of these variables will show up high in feature importance and the other will be low, while with the tree-based model, both will be closer to the middle.

Answer 150

Speed is showing the time it takes for the model to score 2000 records in milliseconds. Most importantly, it is NOT measuring round-trip API call, i.e. network latency. This may be of interest, and this must be tested in the customers systems.

Answer 151

Model management uses the holdout to make predictions which it can then compare against predictions on new data, to track changes in average prediction value.

Answer 152

There could be several reasons, but two most common are (1) prediction latency and (2) organizational readiness. To clarify on the latter, some organizations very much favor linear models and/or decision trees for perceived interpretability reasons.

Answer 153

Tree-based variable importance is based on the model fit on the training data.

Answer 154

Feature impact is a direct measure of a feature's impact on predictions, while tree-based variable importance is a proxy for the impact a variable has on predictions, thus they can have different results.

Answer 155

Hotspots can be used this way, but it depends on what the customer wants the clustering to tell them or what they plan to use it for. Also, be aware that there are many overlapping 'clusters' and you'd probably only want to use disjoint clusters.

Answer 156

Training Data

Answer 157

These are a byproduct of the model fitting process, thus training data creates coefficients.

Answer 158

DR does a lot of data preprocessing, e.g. transforms, derived features such as ratios & differences, etc. When linear models are fit, they are fit to these derived/processed variables, not the raw features that were in the dataset. Thus it is these processed variables that get coefficients, not the raw variables, and this is what you see in the coefficient tables.

Answer 159

Word Cloud insights, whether accessed via the Leaderboard or the Insights tab, are not available when a text mining model is trained into the validation set or at 100%.

Answer 160

The word cloud is based on coefficients which are determined when the model is fit, which means the training partition is used to create the word cloud.

Answer 161

A partition value of "-2" means the target value was missing.

Answer 162

The validation partition, although unmarked, is the largest partition, by number of rows, that isn't the holdout partition. In the case of a tie, DataRobot chooses a partition randomly from those that were largest.

Answer 163

With CodeGen you get java, either source or binary. With DR Prime you get either java or python. CodeGen produced exact predictions for a model, but isn't available for all models. DR Prime is an approximation to a model, but it can be used to approximate any model.

Answer 164

You have to choose (1) if you want to set/change the prediction threshold and (2) you have to decide if you want DR to track data drift, which requires DR to store scoring data. (1) can not be changed, (2) can.

Answer 165

When an API call is made, the time it takes to get a response back could be called "total time". It is not possible for us to track this. Once the request is received by DR servers, we measure two things. Response time is how long DR spent processing a prediction request (receiving the request and returning a response. Execution time is the time DR spent scoring a prediction request.

Answer 166

Service health captures several metrics related to prediction volume

Answer 167

Nothing. The API calls the deployment in DR. The deployment points to a model, so you only need to make the change within the deployment inside of DR. Since your API will continue to reference the same deployment, no changes are needed.

Answer 168

As of release 5.0, accuracy can only be tracked for external deployments, and it will only be tracked for external deployments that had actual values included in the uploaded scoring data.

Answer 169

The cutoff for cross-validation is a hard cutoff at 50,000 rows. If you require automatic cross-validation, use a dataset with 49,999 or fewer rows. You can also manually run cross-validation.

Answer 170

(Not customer-facing): To save time communicating with the front-end, DataRobot only calculates 100 thresholds for the confusion matrix. The threshold the user selects on the UI actually gets rounded down to the closest percentile, which results in a slightly different confusion matrix than what is expected with exact calculations using the desired threshold.

Answer 171

"Feature Fit is computationally intensive, especially for datasets with many variables. We populate Feature Fit with variables in the order they appear on the data tab. This measure of importance is done using univariate ACE scores, and therefore won't match the ""variable importance"" tab for a given model. If your dataset has hundreds of columns, and the variable you are interested in is close to the bottom of the ""data"" tab, when sorted by importance, you may need to wait a few hours for feature fit to calculate for that variable, for a given model. Also, we cap model xray at 500 variables max, so if a variable is not in the top 500 variables by ACE score, it will never show up in feature fit. Text features and the target will never show up in model xray. "

Answer 172

For a full list, see the docs, but a few of the blender types are average, median, ENET, and some tree-based and NN blenders.

Answer 173

In every environment (on-prem, and in the cloud), the autopilot runs in 3 stages. The third, or maximum stage is always <= 500MB. You can manually run models higher than that (up to 10GB on-prem, 5.0GB in the cloud)

Answer 174

You will end up with three models. Each learning rate will produce a different model in the leaderboard that you can compare directly.

Answer 175

MAPE is only available when the response contains no zero's or negatives.

Answer 176

Prediction explanations from the prediction explanations tab are in-sample, so do not use them for anything important. Predictions on training data downloaded from the predictions tab are actually stacked predictions, and are out-of-sample. Use them for anything important.

Answer 177

Yes - we don't take the values found in the first split, but run the whole tuning again for each split

Answer 178

We use the model trained on the first CV split

Answer 179

No hard upper bound, customers have used datasets with millions of characters and 100's of thousands of words in a single field

Answer 180

If your data is on a hadoop cluster, you should score it in place. If it is not on a hadoop cluster, then use our batch scoring script which will send it to the dedicated prediction server.

Answer 181

The “ExtraTrees” model is a refinement of Random Forests, with more randomness: the splits considered for each variable are also random. This decreases the variance of the model but potentially increases its bias. The ExtraTrees models has an additional advantage in that it is computationally very efficient: no sorting of the input data is required to find the splits, because they are random.

Answer 182

Essentially, modeling workers are powering all the analysis you do from the GUI and from R/Python clients. These resources are typically used to build models, hence they are called "modeling workers". If you have a model deployed and it's calling for predictions in real time, if you were to send those prediction requests to the same resources used for modeling at the same time someone else is consuming those resources building models, then your prediction request would get stuck in a queue. For this reason, we have stand-alone resources called prediction servers which are used solely for making predictions.

Answer 183

The default is 16/32/64, however the final stage is always capped at 500MB. Round 3 will therefore run on 500MB/{dataset size}, round 2 will be 250MB/{dataset size}, and round 1 will be 125MB/{dataset size}.

Answer 184

Feature importance is not calculated for multicass classification projects.

Answer 185

Out of time validation trains on ealier data, validates on later data, and extrapolates into the new unseen time values. Time series does all this, plus it takes panel data, detects and handles trends and seasonality, and derives numerous lagged features.

Answer 186

The prediction distributions form the foundation for the rest of the exhibits on the ROC page. The prediction distributions show the output of the models, which are giving probabilities. The confusion matrix is a summary of the two distributions for a given threshold. (Positive distribution, above threshold and below, along with negative distribution, above threshold and below; these give the 4 quadrants of the confusion matrix.) . The ROC curve is a summary of the true positive and false positive rates off of each confusion matrix. The Cumulative Gain chart is then telling you, as you move the threshold from, say, right to left, what your true positive rate is for the records above the threshold. This is essentially measuring how well the model has aggregated the positive records at one end of the sort order, so it is a ranking measure, similar to AUC or Gini.

Answer 187

DataRobot allows you to import and train your own R and Python models by using Jupyter Notebook directly in the application. You can access those development environments by clicking on the "Jupyter" tab at the top of the screen. Using those, you can also utilize R and Python standard libraries to manipulate and visualize the dataset or to interact with the leaderboard.

Answer 188

DR is installed on our AWS cloud, an environment we maintain. DR can also be installed on your hardware, either on a linux server, a hadoop cluster or in your virtual private cloud, such as AWS, Azure, etc. Defer deeper questions to support and/or FE.

Answer 189

Local machine, URL, Hadoop, or JDBC

Answer 190

The list is on the DataRobot data ingest page: .csv, .tsv, .dsv, .xls, .xlsx, .sas7bdat, .geojson, .bz2, .gz, .zip, .tgz. To summarize: text, excel, sas or various zipped files.

Answer 191

"Up to 5.0GB in the cloud and on-prem (*non-Hadoop*) Up to 10GB on-prem for all models (*ONLY* Hadoop, or AWS_VPC with S3 backend) Up to 100GB on Hadoop for ScaleOut models, with non-scaleout models downsampled to 10GB Dataset must have <20k columns. There is no max number of rows. Min number of rows depends if regression, classification, etc. For classification, 100 is the minimum number of rows."

Answer 192

DataRobot Support posts up to date information here on product releases, new features, and FAQs as well as known information on bugs or outages. Past product release notes are here as well.

Answer 193

The manage projects screen uses tags. The disadvantage of folders is that a file can only be in one folder at a time, while you can have many tags on a file. For example, with folders, you might create one folder per customer, but then finding all regression projects quickly would be a pain as you'd have to go through each folder. With tags, however, you can have customer tags, model type tags, etc, which makes it easy to searchj and filter by various criteria.

Answer 194

Regression, classification (2-100 classes), time series regression, time series binary classification and anomaly detection

Answer 195

Raw features are the features that existed in the dataset you uploaded. All features is raw plus features derived from date variables, ie day of week, month of year, etc. Informative features is all features minus features that are trivially uninformative, ie those that have one value. Univariate selections appear after pressing the start button and they are the features that are most strongly correlated with the target on a univariate basis (using ACE scores). DR reduced features is a feature list created after modeling which DR creates from the most impact non-redundant features from the top model. 'Informative feature - leakage removed' is the informative features feature list with any features that have been identifed as target leakage removed.

Answer 196

Non-informative features can be identified by these grey prefixes, which describe the reason why the feature is uninformative.

Answer 197

(A) This question ignores an organization's deployment costs, and (B) the "80" data prep consists of two steps: (1) a big SQL join to merge several datasets into one and (2) making that dataset model ready by encoding variables, cleaning up missing values, transforming features, searching for interactions, non-linearities, etc. DR does all this for you.

Answer 198

Autopilot runs the models that will give a good balance of accuracy and runtime. Models that offer possibility of improvement at significant increases in runtime are held in the repository. It is a good practice to run autopilot, identify the algorithm that performed best on the data, then run all variants of that algorithm in the repository.

Answer 199

The data will be out-of-sample,so either validation, CV or holdout. (unless the model is trained into the validation or holdout sets)

Answer 200

You will see two differen thresholds displayed on the ROC tab. You can change the threshold to experiment and look at different confusion matrices on the ROC tab, but doing so does NOT change the threshold used predictions are made. That threshold can also be set in the 'Threshold set for Prediction Output' section of the ROC tab (as well as in Deployments).

Answer 201

The density chart displays an equal area underneath both the positive and negative curves. The area underneath each frequency curve varies and is determined by the number of observations in each class.

Answer 202

The data will be out-of-sample,so either validation, CV or holdout. (unless the model is trained into the validation or holdout sets)

Answer 203

These two exhibits show the same thing with two minor differences. First, the variable order shown on the left is determined by ACE scores in the feature fit exhibit but is shows by feature impact in the feature effects exhibit. Second, by default partial dependance is turned off in feature fit (though it can be turned on by the user), while actual and predicted are turned off by default in feature effects (though it too can be turned on.) Feature fit is computed using pre-modelling metrics (the importance score), whereas feature effects uses the feature impact score, which is determined after the model is run.

Answer 204

Because there are so many unique words and ngrams in free-form text, they cannot be shown in a graph the way other variables can. Even the top few words often show up in a very small percentage of the rows, so there would be very little data if we were to show the top few variables the way we do with categorical.

Answer 205

These are records from the validation partition.

Answer 206

The number in the ID column is the row number ID from the imported dataset.

Answer 207

Learning curves are designed to answer this question. As more observations are added, a model's performance will improve initially and then begin to level off. This is important for anyone that says they have more data than DR can handle. Second, it's important to distinquish what happens if you add more columns vs more rows to your training dataset. Often we get the question about "more data", but the answer very much depends on if the additional data is features or observations.

Answer 208

Not all models show three sample sizes. As DataRobot re-runs data with a larger sample size, the software only applies it to the highest scoring models from the previous run. (Note, if you re-order the Leaderboard display, for example to sort by cross-validation score, the Speed vs Accuracy graph continues to plot the top 10 validation score models.)

Answer 209

Word cloud and text mining in the insights tab. You will probably want to see how the models performed that used the text (not all blueprints use text). To find them, search the leaderboard for 'text'.

Answer 210

Any model that produces coefficients can be identified on the leaderboard with a "beta" tag. Alternatively, you can find coefficients collected into one place on the insights page.

Answer 211

Greybox, or censored blueprints, is a way for us to protect our IP. It determines whether our blueprints show the detail of all preprocessing steps or not.

Answer 212

Yes - All hyperparameters for each specific algorithm are documented in the "DataRobot Model Docs." The link to any specific model is accessible by clicking the model box in the blueprint.

Answer 213

For regression projects, 20 data rows plus header rows. For classification projects, 100 data rows plus header rows.

Answer 214

DataRobot is the correct spelling. It is important not to use other spellings for branding purposes, but using "Data Robot" is particularly bad because this is viewed differently from "DataRobot" in online searches, etc.

Answer 215

You can add up to five features to the predictions when you download, so simply add the original target.

Answer 216

Data collection, setup, modeling, evaluate, interpret, deploy, monitor. Of course this is an interative process.

Answer 217

1. Highlighting of Data Quality Issues 2. Automatic handling of missing values 3. Suggesting problem type (regression vs classification)

Answer 218

A modeling algorithm fits a model to data, which is one part of a blueprint. The blueprint also consists of data preprocessing. This is a vital difference; often customers at first glance will say "It looks like I still have to prepare my data for modeling, I spend 80% of my time doing that today, you've only automated the other 20%". This is not accurate because the 80% the customer is referring to consists of two parts: (a) a big sql join to create the flat file (typically a fairly easy process) and (b) making that flat file model ready, ie encoding categoricals, transforming numerics, imputing missing, parsing words/ngrams, etc. and this process of making the data model ready also depends on which algorithm you're going to use. With DR, you still create the flat file, but you do not need to make it "model ready", DR does that for you, and it does it for each different algorithm, oftentime multiple approaches for one algorithm. (Not to mention that this 80/20 argument also completely neglects deployment costs to an organization.)

Answer 219

Clients often prefer on-prem when they want to control security, or if they have a hadoop cluster. The security concerns could be their own corporate goverance or it could be regulatory requirements, such as HIPAA. With an on-premise install, the customer maintains the environment (though our support team is great helping even with this stuff that's normally not our responsibility when they can), whereas with our pubic cloud, we maintain the environment, which is less overhead for the customer to deal with.

Answer 220

With more modeling workers, you can build more models in parallel. If you are building models in 2 or more projects, you can allocate worker between them. For example, if you have 10 modeling workers at your disposal, you can crank both projects up to ten, but now they are competing with each other for the next available worker. You could allocate 5 workers to one project and 5 to the other, ensuring that each project has workers.

Modeling Process Flashcards

(245 cards)