System Design - Ad Click Aggregator Flashcards

1
Q

What is the API or System Interface?

A

Input: click data from users. Browser makes API call to server to record click. Server will validate the click (impression ID is valid, signed)

Outputs:
Ad Click data for Advertisers.
A database table (SQL) where they can click for campaign data at these levels of granularity:
minute, hour, day, week, month

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the Data Flow?

A
  1. User clicks on an ad on a website.
  2. The click is tracked and stored in the system.
  3. The user is redirected to the advertiser’s website.
  4. Advertisers query the system for aggregated click metrics.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Delete me

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the core requirements?

A
  1. Users can click on an ad and be redirected to the advertiser’s website
  2. Advertisers can query ad click metrics over time with a minimum granularity of 1 minute
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Out of scope requirements?

A

Ad targeting
Ad serving
Cross device tracking
Integration with offline marketing channels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Non Functional Requirements?

A

Core Requirements

Before we jump into our non-functional requirements, it’s important to ask your interviewer about the scale of the system

  1. Peak of 10k clicks per second
  2. Low latency analytics queries for advertisers (sub-second response time)
  3. Fault tolerant and accurate data collection. We should not lose any click data.
  4. As realtime as possible. Advertisers should be able to query data as soon as possible after the click.
  5. Idempotent click tracking. We should not count the same click multiple times.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

High Level Design

A

See below

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

HLD - things to show

A
  1. Browser click, redirect 302
  2. Kafka - realtime stream processor. Flink. Keep a running count of click totals in memory and update them as new events come in. When time window ends, we flush aggregated data to OLAP database.

When click processor service writes event to Kafka stream, processor reads the event from stream and aggregates in real-time. Flink aggregation window configuration isn’t too bad.
Flink can flush intervals on the minute boundary.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

HLD - how do we scale the components
1. Click service
2. Kafka Stream
3. Stream processor
4. OLAP database

A
  1. Add more servers (horizontally). Have a load balancer and/or API gateway in the front
  2. Kafka can handle lots of events. For example 1 MB/s or 1,000 records/sec per shard. Shard by Ad ID, and potentially Ad ID:number to deal with popular ads.
  3. Stream processor
    Can be scaled horizontally by adding more tasks.
  4. OLAP database. Scaled horizontally by adding more replicas. Read-only slaves. Cache common queries in Redis. Monitor stuff with SLAs.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How to ensure that we don’t lose any click data

A

Kafka is highly available already. There is internal replication. Can configure a retention period of 7 days.

There is also something called checkpointing. However, if Flink were to go down, we could just re-read from the stream and re-aggregate again.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Reconciliation to ensure our click data is truly correct. (Double checking the ad counts).

A

Have a periodic job that double checks the click counts.

At the end of stream, dump the raw click data to a data lake such as S3. There is a filesystem connector in Flink to do this.

Run a batch job to read all the click data and re-aggregate them. Then compare the results of the batch job with the results from stream processor (OLAP). If there are differences, investigate and then update the OLAP DB.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How to prevent users from click fraud or clicking multiple times?

A

Have a unique impression ID. This info is sent in the /click API. We can de-dup clicks based on the impression ID.

Have a impression ID be digitally signed with a secret key. Click processor will verify the signature. Check if click is in the cache already. If not, then this is the first impression. Have an expiration as well.
16 bytes x 100 million = 1.6 GB

Cache data should be small. Have a distributed cache like Redis Cluster. If part of the cache goes down, then

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How to make Metrics query fast for advertisers?

A

Store data in OLAP database with aggregate amounts pre-computed.
Have a nightly cronjob that aggregates the data and places in new table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What topics should we use for the Kafka message queue?

A

Can have a topic for each AdID
If there is an extremely popular ad, we could do AdID:N as the topic.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How to shard the OLAP database

A

Shard by Advertiser ID, so all there data is on the same node. Could also shard by AdID

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Ask ChatGPT about designing the real-time system

A

See your Evernote

Also, having the partition key of AdID:N causes aggregation events further downstream. since you’ll have to read from multiple partitions. There is some presentation from Yelp about this.

From Evan:
Good callout, you can configure a Flink job to read from multiple partitions and still aggregate data effectively across those partitions.