high level overviews, storage, tables Flashcards
Code-Deployment System high level overview
our system can actually very simply be divided into 2 clear subsystems:
- Build System that builds code into binaries
- Deployment System that deploys binaries to our machines across the world
Code-Deployment System storage
We’ll use blob storage (Google Cloud Storage or S3) to store our code binaries. Blob storage makes sense here, because binaries are literally blobs of data.
Code-Deployment System table
- jobs table
- id: pk, auto-inc integer
- created_at: timestamp
- commit_sha: string
- name: string, the pointer to the job’s eventual binary in blob storage
- status: string, QUEUED, RUNNING, SUCCEEDED, FAILED
- table for replication status of blobs
AlgoExpert high level overview
We can divide our system into 3 core components:
- Static UI content
- Accessing and interacting with questions (question completion status, saving solutions, etc.)
- Ability to run code
AlgoExpert storage
For the UI static content, we can put public assets like images and JS bundles in a blob store: S3 or GCS. Since we’re catering to a global audience and we care about having a responsive website, we want to use a CDN to serve that content. This is esp important for mobile b/c of the slow connections that phones use.
Static API content, like the list of questions and all solutions, also goes in a blob store for simplicity.
AlgoExpert table
Since this data will have to be queried a lot, a SQL db like Postgres or MySQL seems like a good choice.
Table 1. question_completion_status
- id: pk, auto-inc integer
- user_id
- question_id
- completion_status (enum)
Table 2. user_solutions
- id: pk, auto-inc integer
- user_id
- question_id
- language
- solution
Stockbroker high level overview
- the PlaceTrade API call that clients will make
- the API server(s) handling client API calls
- the system in charge of executing orders for each customer
Stockbroker table
- for trades
- id
- customer_id
- stockTicker
- type: string, either BUY or SELL
- quantity: integer (no fractional shares)
- status: string, the status of the trade; starts as PLACED
- reason: string, the human readable justification of the trade’s status
- created_at: timestamp, the time when the trade was created - for balances
- id, customer_id, amount, last_modified
Amazon high level overview
There’s a USER side and a WAREHOUSE side.
Within a region, user and warehouse requests will get round-robin-load-balanced to respective sets of API servers, and data will be written to and read from a SQL database for that region.
We’ll go with a SQL db because all of the data is, by nature, structured and lends itself well to a relational model.
Amazon table
6 SQL tables
1. items (name, description, price, etc)
2. carts
3. orders
4. aggregated stock (all of the item stocks on Amazon that are relevant to users)
5. warehouse orders
6. warehouse stock (must have physicalStock and availableStock)
FB news feed high level overview
- 2 API calls, CreatePost and GetNewsFeed
- feed creation and storage strategy, then tie everything together
FB news feed storage
We can have one main relational database to store most of our system’s data, including posts and users. This database will have very large tables.
Google Drive high level overview
- we’ll need to support the following operations:
- files: upload, download, delete, rename, move
- folders: create, get, rename, delete, move - design storage solution for
- entity (files and folders) metadata
- file content
Google Drive storage
To store entity info, we use K-V stores. Since we need high availability and data replication, we need to use something like Etcd, Zookeeper, or Google Cloud Spanner (as a K-V store) that gives us both of those guarantees as well as consistency (as opposed to DynamoDB, for instance, which would give us only eventual consistency).
To store file chunks, GCS.
To store blob reference counts, SQL table.
Google Drive table
for files and folders.
both have:
id, is_folder (t/f), name, owner_id, parent_id
difference: files have blobs (array of blob hashes); folders have children (array of children IDs)
Netflix high level overview
- Storage (Video Content, Static Content, and User Metadata)
- General Client-Server Interaction (i.e., the life of a query)
- Video Content Delivery
- User-Activity Data Processing
Netflix storage
- Since we’re only dealing with a few hundred terabytes of video content, we can use a simple blob storage solution like S3 or GCS.
- Static content (titles, cast lists, descriptions) in a relational db or even in a document store, and we can cache most of it in our API servers.
- User metadata in a classic relational db like Postgres.
- User activity logs in HDFS
Tinder high level overview
- Overview
- Profile Creation
- Deck Generation
- Swiping
maybe super-liking and undoing
Tinder storage
Most of the data that we expect to store (profiles, decks, swipes, matches), is structured, so we’ll use SQL. All of this data will be stored in regional dbs, located based on user hot spots. We’ll have asynchronous replication btwn the regional dbs.
The only exception is users’ profile pics, which goes in a global blob store and will be served via CDN.
Tinder table
all SQL.
1. profiles (each row is a profile)
2. users’ matches (each row is one user’s deck)
3. swipes
4. matches
Slack high level overview
2 main sections:
- Handling what happens when a Slack app loads.
- Handling real-time messaging as well as cross-device synchronization.
Slack storage
Since our tables will be very large, esp the messages table, we must shard. We can use a “smart” sharding solution; this service can be a strongly consistent key-value store like Etcd or ZooKeeper, mapping orgIds to shards
Slack table
SQL db since we’re dealing w/ structured data that’s queried frequently. tables for:
1. channels
2. channel members (each row is a channel-member pair)
3. Historical Messages
4. Latest Channel Timestamps
5. Channel Read Receipts (lastSeen is updated whenever a user opens a channel)
6. number of unread user mentions
AirBNB high level overview
HOST side and RENTER side. We further divide the renter side:
- Browsing listings.
- Getting a single listing.
- Reserving a listing.
AirBNB storage
listings and reservations in SQL table (src of truth)
since we care about the latency of browsing listings on Airbnb, and since this browsing requires querying listings based on their location, we can store our listings in a region quadtree.
Since we’ll be storing our quadtree in memory, we must ensure that a single machine failure doesn’t bring down the entire browsing functionality. So we can set up a cluster of machines, each holding an instance of our quadtree in memory, and these machines can use leader election.
AirBNB table
SQL tables for
- listings
- reservations
AlgoExpert load balancing
we can have 2 primary clusters of backend servers in the 2 important regions: U.S. and India.
We can have some DNS load balancing to route API requests to the cluster closest to the user, and within a region, we can have path-based load balancing to separate our services (payments, authentication, code execution, etc.), esp since the code execution platform will probably need to run on different kinds of servers compared to those of the rest of the API.
AlgoExpert caching
We can implement 2 layers of caching for static API content: client-side (1. users will only need to load questions once per session, 2. load on backend servers is reduced) and server-side.
Stockbroker API call
PlaceTrade(customerId, stockTicker, type (BUY/SELL), quantity)
=>
(tradeId, stockTicker, type: string, quantity, createdAt, status (always PLACED))
we can imagine that a GetTrade API call could return the statuses IP, FILLED, REJECTED, along with a reason
Stockbroker load balancing
We’ll need multiple API servers to handle all of the incoming requests. Since we don’t need any caching when making trades (don’t care about server stickiness), we can just use some round-robin load balancing to distribute incoming requests among our API servers.
atomicity
Atomicity is a feature of databases systems dictating where a transaction must be all-or-nothing
consistency
only valid data will be written; data cannot be written that would violate the database’s own rules for valid data
Amazon caching
users call the GetItemCatalog(search) endpoint when they’re searching for items. The request is routed by API servers to the smart search-results service, which interacts directly with the items table, caches popular item searches, and returns the results.
FB news feed API call
CreatePost(user_id, post)
GetNewsFeed(user_id, pageSize, nextPageToken) =>
(posts: [], nextPageToken)
supports pagination
FB news feed sharding
based on the user id, NOT post id b/c then, news feed generation would require cross-shard joins
G Drive sharding
We must shard our entity metadata, across multiple clusters K-V stores. Sharding on entityID means we’ll lose the ability to perform batch operations, which these K-V stores give us out of the box and which we’ll need when we move entities around.
We should shard based on ownerID, which means we can edit the metadata of multiple entities ATOMICALLY with a transaction.
G Drive LB
Given the traffic that this website needs to serve, we can have a layer of proxies for entity information, load balanced on a hash of the ownerID. The proxies could have some caching
Netflix sharding
we can split our user-metadata db into a handful of shards, based on user id, each managing anywhere between 1 and 10 TB of indexed data. This will maintain very quick reads and writes for a given user.
write-through
a storage method in which data is written into the cache and the corresponding main memory location at the same time
Netflix LB
We can use round-robin load balancing to distribute user network requests across our API servers, which can then load-balance db requests according to userId.
Netflix caching
we can cache static content in our API servers, periodically updating it when new movies and shows are released, and we can also cache user metadata there, using write-through.
Slack sharding
Since our tables are very large, esp msgs, we must shard.
Natural approach: shard based on org size. Biggest orgs in their own shards; smaller organizations grouped together in other shards.
BUT we’ll have problems when an org’s size increases dramatically or when activity surges within an org. Hotspots mean latency goes up.
So we add a service that’ll asynchronously measure org activity and “rebalance” shards accordingly. This service can be a strongly consistent K-V store (Etcd / ZooKeeper), mapping orgIds to shards. Our API servers communicates w/ this service to know which shard to route requests to.
idempotent
a property of some operations such that no matter how many times you execute them, you achieve the same result
Slack LB
We’ll want a load balancer btwn the clients and the API servers that sub to Kafka topics, which will also talk to the “smart” sharding service to match clients with the right API servers.
Airbnb LB
Host side - requests to create and delete listings are LBed across a set of API servers using round-robin. The API servers are in charge of writing to the SQL db.
Renter side - LB requests to list, get, and reserve listings across a set of API servers using an API-path-based server-selection strategy.
Note: NO caching at our API servers – we’ll run into stale data as reservations and listings appear.