System Design - Ad Click Aggregator Flashcards
What is the API or System Interface?
Input: click data from users. Browser makes API call to server to record click. Server will validate the click (impression ID is valid, signed)
Outputs:
Ad Click data for Advertisers.
A database table (SQL) where they can click for campaign data at these levels of granularity:
minute, hour, day, week, month
What is the Data Flow?
- User clicks on an ad on a website.
- The click is tracked and stored in the system.
- The user is redirected to the advertiser’s website.
- Advertisers query the system for aggregated click metrics.
Delete me
What are the core requirements?
- Users can click on an ad and be redirected to the advertiser’s website
- Advertisers can query ad click metrics over time with a minimum granularity of 1 minute
Out of scope requirements?
Ad targeting
Ad serving
Cross device tracking
Integration with offline marketing channels
Non Functional Requirements?
Core Requirements
Before we jump into our non-functional requirements, it’s important to ask your interviewer about the scale of the system
- Peak of 10k clicks per second
- Low latency analytics queries for advertisers (sub-second response time)
- Fault tolerant and accurate data collection. We should not lose any click data.
- As realtime as possible. Advertisers should be able to query data as soon as possible after the click.
- Idempotent click tracking. We should not count the same click multiple times.
High Level Design
See below
HLD - things to show
- Browser click, redirect 302
- Kafka - realtime stream processor. Flink. Keep a running count of click totals in memory and update them as new events come in. When time window ends, we flush aggregated data to OLAP database.
When click processor service writes event to Kafka stream, processor reads the event from stream and aggregates in real-time. Flink aggregation window configuration isn’t too bad.
Flink can flush intervals on the minute boundary.
HLD - how do we scale the components
1. Click service
2. Kafka Stream
3. Stream processor
4. OLAP database
- Add more servers (horizontally). Have a load balancer and/or API gateway in the front
- Kafka can handle lots of events. For example 1 MB/s or 1,000 records/sec per shard. Shard by Ad ID, and potentially Ad ID:number to deal with popular ads.
- Stream processor
Can be scaled horizontally by adding more tasks. - OLAP database. Scaled horizontally by adding more replicas. Read-only slaves. Cache common queries in Redis. Monitor stuff with SLAs.
How to ensure that we don’t lose any click data
Kafka is highly available already. There is internal replication. Can configure a retention period of 7 days.
There is also something called checkpointing. However, if Flink were to go down, we could just re-read from the stream and re-aggregate again.
Reconciliation to ensure our click data is truly correct. (Double checking the ad counts).
Have a periodic job that double checks the click counts.
At the end of stream, dump the raw click data to a data lake such as S3. There is a filesystem connector in Flink to do this.
Run a batch job to read all the click data and re-aggregate them. Then compare the results of the batch job with the results from stream processor (OLAP). If there are differences, investigate and then update the OLAP DB.
How to prevent users from click fraud or clicking multiple times?
Have a unique impression ID. This info is sent in the /click API. We can de-dup clicks based on the impression ID.
Have a impression ID be digitally signed with a secret key. Click processor will verify the signature. Check if click is in the cache already. If not, then this is the first impression. Have an expiration as well.
16 bytes x 100 million = 1.6 GB
Cache data should be small. Have a distributed cache like Redis Cluster. If part of the cache goes down, then
How to make Metrics query fast for advertisers?
Store data in OLAP database with aggregate amounts pre-computed.
Have a nightly cronjob that aggregates the data and places in new table.
What topics should we use for the Kafka message queue?
Can have a topic for each AdID
If there is an extremely popular ad, we could do AdID:N as the topic.
How to shard the OLAP database
Shard by Advertiser ID, so all there data is on the same node. Could also shard by AdID