Common Designs Flashcards

Question 1

Q

Web Crawler - Initial Design (high level)

Answer

A

Backend Crawler:

grokking:

Pick a URL from unvisited list.
Determine IP address
connect to IP to download the document
parse document contents to look for new URLs
add new URLs to unvisited URLs
process the downloaded document (index)
go back to step 1

Question 2

Q

Web Crawler - Code / Abstractions

Crawler class, how it interacts with list of urls in DB, and page data

Answer

A

Crawler

PagesDataStore

Page

Question 3

Q

Dropbox (File Hosting)

Answer

A

Functional:
Clients specify a workspace/directory where all files in it get synced.
View history of updates to file

Non-Functional:
- Huge read/write volumes and 50/50 read/write ratio

Strategy:
Divide files into chunks.
- Only failed chunks are retried, only updated chunks are transferred, don’t store duplicate chunks across file versions.
Track chunks with a metadata file. It has the indexes to each chunk in the file.

Block service uploads/downloads file chunks from cloud storage.
Metadata service keeps file metadata in SQL/NoSQL
Synchronization service will notify all clients that file changes need to be downloaded.

Component Design:

File chunk size: determined by study of average file size, network bandwidth, cloud block storage optimization

Client:

Has: Internal db, Chunker, Indexer, Watcher.
Watcher tracks files in workspace. Detects a change.
Chunker organizes into standard chunk sizes, calculates hash.
Indexer compares to internal DB, only uploads changed chunks.
Detects file changes in the workspace, updates it internal db, uploads to block service, notifies sync service.
websocket or long polling with Sync service to know when other clients push updates and fetch chunks from block service.

Metadata db:

File names, chunks, users, devices, workspaces
SQL or NoSQL, either way we need ACID properties to avoid competing updates

Sync Service:
- Hashes file chunks to detect differences.

Queueing:

Because clients are smart and storing locally, we can do most work asynchronously and use available, reliable, scalable queues between clients and backend services.
There is a global request queue for all clients, and individual response queues.

Caching:

Caches for block and metadata storage.
We can store hottest chunks in LRU cache.

Metadata DB scaling:

Dropbox scaled MySQL into something called EdgeStore
But if you use NoSQL: consistency is important, so if we use NoSQL instead of RDBMS, we need the application logic to ensure serializable isolation
Consistent hash on File ID to store chunk data.

Question 4

Q

Facebook Messenger (instant messaging)

Answer

A

Requirements:
1:1 conversations
Online/offline status
Chat history
Group chats
Push notifications

Real-time latency
High consistency across devices. C > A

High-level design:
Clients send chats to chat server
Chat server notifies other users and stores chat in DB
Clients / chat server acknowledge receipt / delivery

Component Design:

Chat server:

Websockets so chat server immediately pushes to clients.
(1:1 chats) Chat server has a key-value to map incoming userID to active websocket connection to recipient
If user2 went offline, key-value will not be there, chat server should retry for a few seconds in case they reconnect, otherwise notify user1 of the failure

Many chat servers:

Sessions: Load balancer maps userIDs to their own server.
This means when message is received, chat server will lookup which other chat server will send to recipient (how?)
Fault tolerance: if a chat server goes down, connections are lost for those users. We can just have clients auto-retry to create new connections.

DB:

Chat service must write asynchronously.
Huge write load of small updates.
Writing 1 row per message is challenging even for NoSQL.
Consider HBase / Cassandra which uses LSM trees and in-memory writes (of newest data) to support high write loads. It also supports fast range-based (i.e. time) queries.

Clients:
- Paginate requests, from newer messages to older.

Handling online-offline statuses:
- When user goes offline, their connection ends with server. Server uses this to push status change to other users.
When user comes online, it creates a connection, server pushes this update to all other users in friend list.

Data Partitioning:
- We should shard messages by consistent hashing the userid, or chat id. NOT the message id, because then messages for a conversation / user need to get fetched from multiple shards.

Caching:
In each shard, cache most recent messages for most recent conversations for most active users.

Important numbers:
Assume 1 chat server can handle 50k concurrent connections

Group chat:
- Use a GroupChatID and a dedicated server for it.

Push notifications:
- If users give permission, chat server can pass messages to a notification service if recipient is offline.

Question 5

Q

Design Uber

Answer

A

Functional:

Customers use app, see map, see drivers
Driver locations are updated
Customers can request a ride
Can see driver location until ride is finished
Drivers say if they are available
Get notified if nearby customer requests ride
Can see customer location until ride is finished
Drivers can mark journey complete

Non-Functional:

Driver location updates every 3 seconds from their app
Ride request sends to drivers in real-time
???

Strategy:

Use a Quadtree to create location grids and track driver locations by grid
Use a cache to receive driver updates every 3 seconds and async process updates the Quadtree every 15 seconds
First time customer opens app, request goes to Quadtree. This returns drivers in radius.
Customer then gets ‘subscribed’ to the 3-second location updates for those drivers.
Use a pull/push model to keep customer subscribed to the right drivers as drivers appear in radius or move away.

API:

Components:

DriverLocationCache:

Small data for driver location: driver id, lat, lng. Even 1M drivers can easily fit on 1 cache.
For availability, scalability, fault tolerance, use a distributed cache sharded by driver id.
Periodic backup to storage in case it crashes

DriverLocationServer:

Wrapper around the cache, receives the location updates, also sends updates to Notification svc for subscribed customers
(does it track customer subs or is that in a separate cache?)

Notification Svc:

Push model via webhooks to clients with open app
When they try to push and detect dead connection, remove it
Alternatively, if customer doesn’t need location updates every 3 seconds, use a pull model and have client pull every 5 seconds, pull from cache

How does RequestRide work?

Customer opens app, requests locations, gets subscribed, receives location updates.
Requests a ride, backend looks at drivers, selects top X based on rating, sends them request, first to accept gets the ride, create a ride session with the customer

Scaling:
- How is Quadtree stored?

Bonus:

How do we handle slow/disconnecting clients?
What happens if client is disconnected during the ride?

Question 6

Q

Typeahead Suggestions

Answer

A

Functional:

Non-Functional:

Strategy:

API:

Components:

Scaling:

Bonus:

Question 7

Q

Web Crawler - More Detailed Design

Answer

A

donne martin:
We have a crawler that keeps working with two lists:

links_to_crawl
We have a list of ‘links_to_crawl’. We assume this was generated for us or we built this up over time.
- This list also has the pages ranked by popularity / priority.
- It also has a ‘signature’ for each page, based on its url and contents. We will use these signatures to detect duplicate content based on something like “Jaccard index”.

links_crawled
We also track “links_crawled”. This also includes the link and the page ‘signature’. It also has a timestamp for “last_crawled”.

The crawler iteratively takes items from links_to_crawl. If a similar signature is already in links_crawled (is there a way to do this in O(1)?) and has a recent timestamp, it skips this page and moves it down in priority (because it was already crawled).

Otherwise, it visits the link and:

queues the page contents to two services.
finds all child urls within the page and adds them to links_to_crawl.

(the child URLs create risk of duplicate pages / infinite loop. the signature check will catch some of these. We should also have async processes that continually map-reduce the links_to_crawl and try to remove any duplicates).

The Reverse Index Service generates a word -> page id index (figures out which words / phrases matter).

The Document Service generates a title / snippet for each page id.

Frontend:
Client types in a series of words as a search term.

The request goes to a Query Service.
It splits the search into words, un-capitalizes, fixes spelling mistakes
It passes these words to a Reverse-Index Service.

The Reverse-Index Service gets related pages and limits/sorts them by relevance and popularity.

The Document Service returns the title and snippets

Components:
The two lists are quite simple, and huge, with high read/write load, so we store them in NoSQL.

links_to_crawl could also be cached in Redis as a sorted set.

There are two queues to the two services.

Client can use public REST API

Internal services could use RPC

Common Designs Flashcards

(7 cards)