System Design - Ad Click Aggregator Flashcards

Question 1

Q

What is the API or System Interface?

Answer

A

Input: click data from users. Browser makes API call to server to record click. Server will validate the click (impression ID is valid, signed)

Outputs:
Ad Click data for Advertisers.
A database table (SQL) where they can click for campaign data at these levels of granularity:
minute, hour, day, week, month

Question 2

Q

What is the Data Flow?

Answer

A

User clicks on an ad on a website.
The click is tracked and stored in the system.
The user is redirected to the advertiser’s website.
Advertisers query the system for aggregated click metrics.

Question 3

Q

Delete me

Question 4

Q

What are the core requirements?

Answer

A

Users can click on an ad and be redirected to the advertiser’s website
Advertisers can query ad click metrics over time with a minimum granularity of 1 minute

Question 5

Q

Out of scope requirements?

Answer

A

Ad targeting
Ad serving
Cross device tracking
Integration with offline marketing channels

Question 6

Q

Non Functional Requirements?

Answer

A

Core Requirements

Before we jump into our non-functional requirements, it’s important to ask your interviewer about the scale of the system

Peak of 10k clicks per second
Low latency analytics queries for advertisers (sub-second response time)
Fault tolerant and accurate data collection. We should not lose any click data.
As realtime as possible. Advertisers should be able to query data as soon as possible after the click.
Idempotent click tracking. We should not count the same click multiple times.

Question 7

Q

High Level Design

Answer

A

See below

Question 8

Q

HLD - things to show

Answer

A

Browser click, redirect 302
Kafka - realtime stream processor. Flink. Keep a running count of click totals in memory and update them as new events come in. When time window ends, we flush aggregated data to OLAP database.

When click processor service writes event to Kafka stream, processor reads the event from stream and aggregates in real-time. Flink aggregation window configuration isn’t too bad.
Flink can flush intervals on the minute boundary.

Question 9

Q

HLD - how do we scale the components
1. Click service
2. Kafka Stream
3. Stream processor
4. OLAP database

Answer

A

Add more servers (horizontally). Have a load balancer and/or API gateway in the front
Kafka can handle lots of events. For example 1 MB/s or 1,000 records/sec per shard. Shard by Ad ID, and potentially Ad ID:number to deal with popular ads.
Stream processor
Can be scaled horizontally by adding more tasks.
OLAP database. Scaled horizontally by adding more replicas. Read-only slaves. Cache common queries in Redis. Monitor stuff with SLAs.

Question 10

Q

How to ensure that we don’t lose any click data

Answer

A

Kafka is highly available already. There is internal replication. Can configure a retention period of 7 days.

There is also something called checkpointing. However, if Flink were to go down, we could just re-read from the stream and re-aggregate again.

Question 11

Q

Reconciliation to ensure our click data is truly correct. (Double checking the ad counts).

Answer

A

Have a periodic job that double checks the click counts.

At the end of stream, dump the raw click data to a data lake such as S3. There is a filesystem connector in Flink to do this.

Run a batch job to read all the click data and re-aggregate them. Then compare the results of the batch job with the results from stream processor (OLAP). If there are differences, investigate and then update the OLAP DB.

Question 12

Q

How to prevent users from click fraud or clicking multiple times?

Answer

A

Have a unique impression ID. This info is sent in the /click API. We can de-dup clicks based on the impression ID.

Have a impression ID be digitally signed with a secret key. Click processor will verify the signature. Check if click is in the cache already. If not, then this is the first impression. Have an expiration as well.
16 bytes x 100 million = 1.6 GB

Cache data should be small. Have a distributed cache like Redis Cluster. If part of the cache goes down, then

Question 13

Q

How to make Metrics query fast for advertisers?

Answer

A

Store data in OLAP database with aggregate amounts pre-computed.
Have a nightly cronjob that aggregates the data and places in new table.

Question 14

Q

What topics should we use for the Kafka message queue?

Answer

A

Can have a topic for each AdID
If there is an extremely popular ad, we could do AdID:N as the topic.

Question 15

Q

How to shard the OLAP database

Answer

A

Shard by Advertiser ID, so all there data is on the same node. Could also shard by AdID

Question 16

Q

Ask ChatGPT about designing the real-time system

Answer

Study These Flashcards

A

See your Evernote

Also, having the partition key of AdID:N causes aggregation events further downstream. since you’ll have to read from multiple partitions. There is some presentation from Yelp about this.

From Evan:
Good callout, you can configure a Flink job to read from multiple partitions and still aggregate data effectively across those partitions.

System Design - Ad Click Aggregator Flashcards

(16 cards)