Part 1: The Tech Lead Flashcards
What is meant by data sparsity?
There may be a lack of representation for some types of data, while the total sample size may be large.
For example, for data aggregated for autonomous driving from the front-facing camera, there may be very few samples of the yellow traffic light at intersections just because they appear less often.
What are outliers?
Points that differ significantly from other observations can shift data distributions significantly. For example, web crawlers that explore all links on a page can shift web behavior data when accidentally included in user behavior analysis.
What are the types and targets of data science projects?
Batch or live.
Diagnostic or predictive.
In the data science project taxonomy, what does hindsight refer to?
Batch x Diagnostic
“What has happened?”
Historical
One standard DS practice is to run an A/B test with a hold-out set composed of customers who do not see an email marketing campaign. We can run the campaign for a few months and assess whether the lack of a marketing campaign impacts long-term engagement.
This kind of hindsight with a long experimentation cycle is not efficient in driving improvements in operations.
In the data science project taxonomy, what does insight refer to?
Diagnostic x Real Time
“What is happening?”
Near term.
Another practice is to produce a real-time dashboard illustrating trends in long-term engagement. We can follow the decays of long-term engagements across user vintages to detect early trends of success or issues. These trends allow the organization to make business decisions in real time with insights from the dashboards.
In the data science project taxonomy, what does * foresight * refer to?
Batch x Predictive
“What may happen?”
Inferential
Foresight—Given historical data, we can also build a model predicting long-term engagement using detectable short-term engagement characteristics, such as open rate, click-through rate (CTR), unsubscribe rate, landing page session length, and session frequency. A prediction model can anticipate long-term effects with short-term observation, so we gain the foresight to adjust our email marketing strategies week-to-week.
In the data science project taxonomy, what does * intelligence * refer to?
Real time x Predictive
“Make it happen”
Influential
Yet more powerful approaches can include real-time analytics on channels such as email to learn the customer segments. We can then prepare sequences of touches on the next best actions (NBAs) to drive long-term engagement for specific segments of users. When we can adapt the content of the next touches based on individual responses in real time, we are beginning to see the intelligence in driving long-term engagements.
What are four data characteristics to consider in a data science project?
- Unit of decisioning.
- Sample size/sparsity/outliers.
- Sample distribution/imbalance.
- Data types.
What is the data characteristic * unit of decisioning *?
The granularity for modelling or analysis. E.g. are we interested in * per employee *, * per business function * etc
What is the data characteristic * sample imbalance *?
The orders of magnitude between class labels. Can be addressed by over sampling, under sampling or synthetic sample generation.
What is the data characteristic * outliers *?
Extreme values that can shift a data distribution.
What are the data characteristics * data types *?
- Tabular, image, text, video etc.
- Time sequenced/series data.
- Graph data.
What is the benefit of feature engineering?
Feature engineering allows us to summarize a vast amount of data meaningfully.
What is momentum based modelling strategy expected to achieve?
The model is expected to:
- Capture trends in the environments.
- Abstract away fundamental factors that are not expected to change in a certain time window.
- Predict what would happen if those trends continue.
What is a required for a Foundational modeling strategy?
Clear causal mechanisms drive the predictability of the outcome.
What are the major stumbling blocks in project execution according to Gartner?
Specifying projects from vague requirements and prioritizing them.
Planning and managing a DS project for success.
Striking a balance among hard trade-offs.
What does RICE stand for in the priority refinement framework?
Reach, Impact, Confidence, Effort.
What does Reach refer to in RICE?
Reach refers to how specific a population a data science project can reach. There are tradeoffs to consider when assessing the reach of a data science project, such as the data available on populations of interest and the size of these populations.
What does Impact refer to in RICE?
Impact is the anticipated lift to key operating metrics for the reachable population.
What does Confidence refer to in RICE?
Confidence refers to the likelihood that the project will produce business impact.