Common Designs Flashcards
Web Crawler - Initial Design (high level)
Backend Crawler:
grokking:
- Pick a URL from unvisited list.
- Determine IP address
- connect to IP to download the document
- parse document contents to look for new URLs
- add new URLs to unvisited URLs
- process the downloaded document (index)
- go back to step 1
Web Crawler - Code / Abstractions
- Crawler class, how it interacts with list of urls in DB, and page data
Crawler
PagesDataStore
Page
Dropbox (File Hosting)
Functional:
Clients specify a workspace/directory where all files in it get synced.
View history of updates to file
Non-Functional:
- Huge read/write volumes and 50/50 read/write ratio
Strategy:
Divide files into chunks.
- Only failed chunks are retried, only updated chunks are transferred, don’t store duplicate chunks across file versions.
Track chunks with a metadata file. It has the indexes to each chunk in the file.
Block service uploads/downloads file chunks from cloud storage.
Metadata service keeps file metadata in SQL/NoSQL
Synchronization service will notify all clients that file changes need to be downloaded.
Component Design:
File chunk size: determined by study of average file size, network bandwidth, cloud block storage optimization
Client:
- Has: Internal db, Chunker, Indexer, Watcher.
- Watcher tracks files in workspace. Detects a change.
- Chunker organizes into standard chunk sizes, calculates hash.
- Indexer compares to internal DB, only uploads changed chunks.
- Detects file changes in the workspace, updates it internal db, uploads to block service, notifies sync service.
- websocket or long polling with Sync service to know when other clients push updates and fetch chunks from block service.
Metadata db:
- File names, chunks, users, devices, workspaces
- SQL or NoSQL, either way we need ACID properties to avoid competing updates
Sync Service:
- Hashes file chunks to detect differences.
Queueing:
- Because clients are smart and storing locally, we can do most work asynchronously and use available, reliable, scalable queues between clients and backend services.
- There is a global request queue for all clients, and individual response queues.
Caching:
- Caches for block and metadata storage.
- We can store hottest chunks in LRU cache.
Metadata DB scaling:
- Dropbox scaled MySQL into something called EdgeStore
- But if you use NoSQL: consistency is important, so if we use NoSQL instead of RDBMS, we need the application logic to ensure serializable isolation
- Consistent hash on File ID to store chunk data.
Facebook Messenger (instant messaging)
Requirements: 1:1 conversations Online/offline status Chat history Group chats Push notifications
Real-time latency
High consistency across devices. C > A
High-level design:
Clients send chats to chat server
Chat server notifies other users and stores chat in DB
Clients / chat server acknowledge receipt / delivery
Component Design:
Chat server:
- Websockets so chat server immediately pushes to clients.
- (1:1 chats) Chat server has a key-value to map incoming userID to active websocket connection to recipient
- If user2 went offline, key-value will not be there, chat server should retry for a few seconds in case they reconnect, otherwise notify user1 of the failure
Many chat servers:
- Sessions: Load balancer maps userIDs to their own server.
- This means when message is received, chat server will lookup which other chat server will send to recipient (how?)
- Fault tolerance: if a chat server goes down, connections are lost for those users. We can just have clients auto-retry to create new connections.
DB:
- Chat service must write asynchronously.
- Huge write load of small updates.
- Writing 1 row per message is challenging even for NoSQL.
- Consider HBase / Cassandra which uses LSM trees and in-memory writes (of newest data) to support high write loads. It also supports fast range-based (i.e. time) queries.
Clients:
- Paginate requests, from newer messages to older.
Handling online-offline statuses:
- When user goes offline, their connection ends with server. Server uses this to push status change to other users.
When user comes online, it creates a connection, server pushes this update to all other users in friend list.
Data Partitioning:
- We should shard messages by consistent hashing the userid, or chat id. NOT the message id, because then messages for a conversation / user need to get fetched from multiple shards.
Caching:
In each shard, cache most recent messages for most recent conversations for most active users.
Important numbers:
Assume 1 chat server can handle 50k concurrent connections
Group chat:
- Use a GroupChatID and a dedicated server for it.
Push notifications:
- If users give permission, chat server can pass messages to a notification service if recipient is offline.
Design Uber
Functional:
- Customers use app, see map, see drivers
- Driver locations are updated
- Customers can request a ride
- Can see driver location until ride is finished
- Drivers say if they are available
- Get notified if nearby customer requests ride
- Can see customer location until ride is finished
- Drivers can mark journey complete
Non-Functional:
- Driver location updates every 3 seconds from their app
- Ride request sends to drivers in real-time
- ???
Strategy:
- Use a Quadtree to create location grids and track driver locations by grid
- Use a cache to receive driver updates every 3 seconds and async process updates the Quadtree every 15 seconds
- First time customer opens app, request goes to Quadtree. This returns drivers in radius.
- Customer then gets ‘subscribed’ to the 3-second location updates for those drivers.
- Use a pull/push model to keep customer subscribed to the right drivers as drivers appear in radius or move away.
API:
Components:
DriverLocationCache:
- Small data for driver location: driver id, lat, lng. Even 1M drivers can easily fit on 1 cache.
- For availability, scalability, fault tolerance, use a distributed cache sharded by driver id.
- Periodic backup to storage in case it crashes
DriverLocationServer:
- Wrapper around the cache, receives the location updates, also sends updates to Notification svc for subscribed customers
- (does it track customer subs or is that in a separate cache?)
Notification Svc:
- Push model via webhooks to clients with open app
- When they try to push and detect dead connection, remove it
- Alternatively, if customer doesn’t need location updates every 3 seconds, use a pull model and have client pull every 5 seconds, pull from cache
How does RequestRide work?
- Customer opens app, requests locations, gets subscribed, receives location updates.
- Requests a ride, backend looks at drivers, selects top X based on rating, sends them request, first to accept gets the ride, create a ride session with the customer
Scaling:
- How is Quadtree stored?
Bonus:
- How do we handle slow/disconnecting clients?
- What happens if client is disconnected during the ride?
Typeahead Suggestions
Functional:
Non-Functional:
Strategy:
API:
Components:
Scaling:
Bonus:
Web Crawler - More Detailed Design
donne martin:
We have a crawler that keeps working with two lists:
links_to_crawl
We have a list of ‘links_to_crawl’. We assume this was generated for us or we built this up over time.
- This list also has the pages ranked by popularity / priority.
- It also has a ‘signature’ for each page, based on its url and contents. We will use these signatures to detect duplicate content based on something like “Jaccard index”.
links_crawled
We also track “links_crawled”. This also includes the link and the page ‘signature’. It also has a timestamp for “last_crawled”.
The crawler iteratively takes items from links_to_crawl. If a similar signature is already in links_crawled (is there a way to do this in O(1)?) and has a recent timestamp, it skips this page and moves it down in priority (because it was already crawled).
Otherwise, it visits the link and:
- queues the page contents to two services.
- finds all child urls within the page and adds them to links_to_crawl.
(the child URLs create risk of duplicate pages / infinite loop. the signature check will catch some of these. We should also have async processes that continually map-reduce the links_to_crawl and try to remove any duplicates).
The Reverse Index Service generates a word -> page id index (figures out which words / phrases matter).
The Document Service generates a title / snippet for each page id.
Frontend:
Client types in a series of words as a search term.
The request goes to a Query Service.
It splits the search into words, un-capitalizes, fixes spelling mistakes
It passes these words to a Reverse-Index Service.
The Reverse-Index Service gets related pages and limits/sorts them by relevance and popularity.
The Document Service returns the title and snippets
Components:
The two lists are quite simple, and huge, with high read/write load, so we store them in NoSQL.
links_to_crawl could also be cached in Redis as a sorted set.
There are two queues to the two services.
Client can use public REST API
Internal services could use RPC