Einstein Discovery Data Prep and Create Stories Flashcards
You can remedy data issues in 2 ways:
Fix issue in CRM Analytics dataset using data prep tools.
Correcting issue in story using story settings. Story fixes don’t affect data in the dataset.
Data Prep Terminology
Variables - Category of data. Columns in dataset.
Observations - Row Value in Dataset
Data Type - Numerical, Categorical (text), Date
Max # Observations in datset
20M
What is a Story
A story contains answers, explanations, predictions, and suggested actions that arranged into an organized presentation with logical flow and related sections. The story is filled with insights about your data as they relate to the outcome you’re interested in. Einstein Discovery walks you through what has happened and why, what has changed, what is likely to happen, and what you can do about it.
Two Types of Stories
Insights Only - only descriptive
Insights and Predictions - all insight types.
2 ways to create a story
Dataset or Template
Ways to create a Story from a Dataset
- Create -> Create from dataset.
- While viewing a lens.
- From dataset dropdown.
Stories and Security Predicates
All users who access the story can see the results of the story. They don’t need the same row-level access as the story creator.
What data in a dataset is a story based on?
A snapshot of the data. Initial data snapshot taken when story is created. If data has changed in source dataset, users with sufficient privileges can refresh story based on most recent data. Otherwise, subsequent changes to the story do not affect the snapshot, and subsequent changes to the dataset are ignored.
Occurences
performs an extensive query analysis of dataset values by calculating the number of times a value occurs in a column, including interactions with other columns. For example, the color red occurs 30% of the time in an Automobile dataset, of those rows the most frequent body type is coupe.
What does template overview provide?
Description, List of Supported Objects, Sample Insights
Issue: Story concurrency limits exceeded
No more than two stories can be created concurrently
Dataflow run limits exceeded
During app creation, story templates runs a dataflow twice - create a dataset used to train the predictive model, and use the predictive model to generate prediction scores and rite back to crm.
If you exceel the max number of dataflow runs in your ord in a 24 hours period
Data Sync-related limits exceeded
Story templates can add objects to Data Sync. If org already has created the max number of data sync objects, will fail.
Daya Sync-related errors
If app creation triggers data sync-related errors, address them in data manager before trying to create the template again.
Elements of a story interface
- Story Headline - Name of story, goal, most recent version
- Story toolbar
- Variables Panel - list of explanetory variables and their correlation to outcome
- Story Version summary - summary of insights, version comparison
- insight Summary Panels - List of variables, ordered by correlation, that positively or negatively impact a story
What does a story headline contain?
The basis of the story. Story Name Version Update Story Goal Story Version
What does the story version summary contain?
Goal
Row Couunt (# obs in analysis)
# Change in Row Count from previous version
Outcome Avg
% Change in outcome avg from previous version
What changed between versions
How to Edit Story
Open Story
click Edit Story
Can change columns, update story to latest dataset change
Use correlation column to see how much each field contributed to the outcome. remove columns that have little to no impact.
What column contains fields that you can improve, such as fields with outliers or duplicates?
Data Alert Column
What can you edit in the general settings tab?
Analysis Type (insights or insights & preds)
Algorithm (GLM, GBM, XGBoost, random Forest)
- select Model Tournament to have ED run all algorithms and show the results of algorithm that performed best
Validation Type -
- Training/Validation Ratio
- Validation Dataset (can specify crm dataset). Will only see datasets that match the schema of your story’s datset.
None (default) - uses only k-fold validation.
Configure Number Variables
change settings for individual numbers in your story.
On Story settings, click number field. Can:
analyze for bias (select to exclude a variable from the model. A SHIELD icon will appear next to the title of the insight to remind you it’s a sensitive variable)
Transform - Replace missing values, projected predictions
Bucket Values by (count, width, manual)
Number of buckets - specify number of buckets to show in charts
Include only – adds min and max values to starting values and ending value fields
Preview - Graph shows number of values that occur across the range of number ranges.
What are projected predictions?
Providing trending data for numeric variables that factor into your predictions to make them more accurate
Configure Projected Predictions
Provide dataset that contains trend data.
Tell story data about the dataset:
Unique columns identifier
Variable column (maps to selected variable in story)
time interval column
time interval number of intervals to project ahead
seasonality (auto or none or number)
What is Fuzzy Matching?
Adds uniformity to spelling variations in variables.
How to track story versions?
ED keeps previous versions of stories so you can track your progress.
View - Version History on story toolbar
cancel a version - Cancel story before submitting it
compare versions - ‘what changed’ button
Disparate Impact alert
significant discrepancy in the way different classes are being treated
Proxy Variable alert
one or more variables highly correlated to a sensitive variable
Outliers Alert
uncommonly large or small numbers
strongest predictors alert
variable that is so highly correlated with outcome that it should be exampled for possible data leakage.
Multicollinearity alert
2 or more variables highly correlated. Would have duplicate impact on outcome variable
High cardinality Alert
variable contains more than 100 unique values
Missing values alert
variable is missing a high percentage of values
Identical values alert
all values of variable are identical
recommended buckets alert
Indicates that, for a numerical value, Einstein Discovery devised an alternative set of buckets (grouping of data points based on ranges).
dominant values alert
Indicates that most values in a variable are in the same category, which can limit its contribution to the analysis.
No Correlation Alert
Indicates that this variable explains no variation in the outcome and has no statistical significance.
Imbalanced Distribution Alert
Indicates a disproportionate ratio of observations in each class in training data.
Potential data Leakage Alert
Indicates that a value in an explanatory variable always results in the same outcome value, which may indicate data leakage. Data leakage occurs when your training data contains the information that you’re trying to predict. Leakage results in models that score optimistically high in training but perform less accurately on live data. To produce more realistic models that perform better at predicting outcomes, investigate and remove leaky predictor variables from your model.
Area Under the curve quality alert
Indicates that this binary classification model’s AUC metric is so high or low that it warrants further examination.
R2 Quality Alert
Indicates that this model’s R2 metric is so high or low that it warrants further examination.
Cross-Validation Failure Alert
Indicates that cross-validation failed for this model.
Rename or move a story
Edit Story Toolbar -> Properties
For App, can change app
For story name, changes name
save
How to delete a story
edit story menu -> delete. Once you delete it can’t be recovered.