Ad Click Event Aggregation Flashcards
What are the functional requirements ?
- Aggregate the number of clicks of ad_id in the last M minutes.
- Return the top 100 most clicked ad_id every minute.
- Support aggregation filtering by different attributes.
- Dataset volume is at Facebook or Google scale (see the back-of-envelope estimation section below for detailed system scale requirements).
What is Real-Time Bidding (RTB) ?
- Process in which digital advertising inventory is bought and sold.
- The speed of the RTB process is important as it usually occurs in less than a second.
- Data accuracy is also very important. Ad click event aggregation plays a critical role in measuring the effectiveness of online advertising, which essentially impacts how much money advertisers pay. Based on the click aggregation results, campaign managers can control the budget or adjust bidding strategies, such as changing targeted audience groups, keywords, etc. The key metrics used in online advertising, including click-through rate (CTR) and conversion rate (CVR) , depend on aggregated ad click data.
What are the non-functional requirements ?
Correctness of the aggregation result is important as the data is used for RTB and ads billing.
Properly handle delayed or duplicate events.
Robustness. The system should be resilient to partial failures.
Latency requirement. End-to-end latency should be a few minutes, at most.
What are the two types of data in the system ?
- Raw Data
- Aggregated Data
What are “raw data” advantages and disadvantages ?
Pros:
* Full data set
* Support data filter and recalculation
Cons:
* Huge data storage
* Slow query
What are “aggregated data” advantages and disadvantages ?
Pros:
* Smaller data set
* Fast query
Cons:
* Data loss. This is derived data. For example, 10 entries might be aggregated to 1 entry
What DBMS are optimized for write and time-range queries ?
- Cassandra
- InfluxDB
- Amazon S3 using one of the columnar data formats like ORC, Parquet , or AVRO.
What is MapReduce ?
MapReduce is a framework for processing parallelizable problems across large datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the same local network and use similar hardware) or a grid (if the nodes are shared across geographically and administratively distributed systems, and use more heterogeneous hardware).
What is DAG model ?
The DAG model represents the well-known MapReduce paradigm. It is designed to take big data and use parallel distributed computing to turn big data into little- or regular-sized data.
In the DAG model, intermediate data can be stored in memory and different nodes communicate with each other through either TCP (nodes running in different processes) or shared memory (nodes running in different threads).
What is lambda architecture ?
Lambda architecture is a system that contains two processing paths (batch and streaming) simultaneously. A disadvantage of lambda architecture is that you have two processing paths, meaning there are two codebases to maintain.
What is kappa archiecture ?
Kappa architecture combines the batch and streaming in one processing path. The key idea is to handle both real-time data processing and continuous data reprocessing using a single stream processing engine.
What timestamp to use to process aggregation ?
Event time vs Processing time discussion
What is “watermark” ?
Watermark is an extension of an aggregation window.
What are the types of window function ?
tumbling (also called fixed) window
hopping window
sliding window
session window.
How to Data deduplication ?
Use external file storage, such as HDFS or S3, to record the offset