Kohavi and Thomke (2017): The Surprising Power of Online Experiments Flashcards
Key Lessons from Online Experiments
The Value of Controlled Experiments:
- Definition: Online experiments (like A/B testing) allow businesses to assess ideas by
comparing a control (current state) with a treatment (proposed change). This scientific method
ensures decisions are evidence-based rather than intuitive.
A/B testing for large companies:
Allows to experiment on multiple ideas concurrently at a low cost per test
Tiny Changes Can Have a Big Impact:
- Contrary to popular belief, progress often comes from implementing numerous small
improvements rather than disruptive changes.
The Role of Infrastructure
- Large-scale experimentation requires:
- Instrumentation: Collecting data on clicks, interactions, and behaviors.
- Data pipelines: For real-time and batch analysis.
- Teams of data scientists: To ensure rigor and reliability.
Challenges with Experimentation:
- Failure Rates: At companies like Google and Bing, only 10%-20% of experiments yield
positive results. This underscores the need for numerous tests to identify breakthroughs. - Complexity and Bugs: Introducing multiple features simultaneously increases the likelihood of
errors. Example: If each new feature has a 10% failure chance, adding 7 features has a >50%
probability of failure.
Importance of Data Quality:
- Rigorous Validation:
- A/A Tests: Testing a feature against itself ensures systems detect no differences when none
exist. - Identify and exclude outliers (e.g., bots or outlier accounts like libraries on Amazon).
Importance of Data Quality:
- Twyman’s Law:
“Any figure that looks interesting is usually wrong.” Surprising results should
be replicated to ensure accuracy.
Importance of Data Quality:
- Segment Variability
Some user segments may react differently to experiments, skewing
overall results. For example, a bug in Internet Explorer 7 significantly distorted Bing’s test
results.
Avoiding Assumptions About Causality:
- Correlation ≠ Causation:
- Example: Observational studies in Microsoft Office falsely suggested advanced features
reduced attrition. In reality, heavy users (who use advanced features) naturally have lower
attrition rates. - Controlled Testing Is Essential: Observational studies may misrepresent the impact of
changes.
Defining Success with Metrics:
- Overall Evaluation Criteria (OEC):
- Composite metrics should align with long-term strategic goals (e.g., revenue, engagement).
- Example: Bing tracks metrics like tasks completed per session to gauge user satisfaction.
- Continuous Refinement:
- Successful experiments often result from understanding short- and long-term metric trade-
offs.