high level overviews, storage, tables Flashcards

1
Q

Code-Deployment System high level overview

A

our system can actually very simply be divided into 2 clear subsystems:
- Build System that builds code into binaries
- Deployment System that deploys binaries to our machines across the world

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Code-Deployment System storage

A

We’ll use blob storage (Google Cloud Storage or S3) to store our code binaries. Blob storage makes sense here, because binaries are literally blobs of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Code-Deployment System table

A
  1. jobs table
  • id: pk, auto-inc integer
  • created_at: timestamp
  • commit_sha: string
  • name: string, the pointer to the job’s eventual binary in blob storage
  • status: string, QUEUED, RUNNING, SUCCEEDED, FAILED
  1. table for replication status of blobs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

AlgoExpert high level overview

A

We can divide our system into 3 core components:

  • Static UI content
  • Accessing and interacting with questions (question completion status, saving solutions, etc.)
  • Ability to run code
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

AlgoExpert storage

A

For the UI static content, we can put public assets like images and JS bundles in a blob store: S3 or GCS. Since we’re catering to a global audience and we care about having a responsive website, we want to use a CDN to serve that content. This is esp important for mobile b/c of the slow connections that phones use.

Static API content, like the list of questions and all solutions, also goes in a blob store for simplicity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

AlgoExpert table

A

Since this data will have to be queried a lot, a SQL db like Postgres or MySQL seems like a good choice.

Table 1. question_completion_status
- id: pk, auto-inc integer
- user_id
- question_id
- completion_status (enum)

Table 2. user_solutions
- id: pk, auto-inc integer
- user_id
- question_id
- language
- solution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Stockbroker high level overview

A
  • the PlaceTrade API call that clients will make
  • the API server(s) handling client API calls
  • the system in charge of executing orders for each customer
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Stockbroker table

A
  1. for trades
    - id
    - customer_id
    - stockTicker
    - type: string, either BUY or SELL
    - quantity: integer (no fractional shares)
    - status: string, the status of the trade; starts as PLACED
    - reason: string, the human readable justification of the trade’s status
    - created_at: timestamp, the time when the trade was created
  2. for balances
    - id, customer_id, amount, last_modified
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Amazon high level overview

A

There’s a USER side and a WAREHOUSE side.

Within a region, user and warehouse requests will get round-robin-load-balanced to respective sets of API servers, and data will be written to and read from a SQL database for that region.

We’ll go with a SQL db because all of the data is, by nature, structured and lends itself well to a relational model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Amazon table

A

6 SQL tables
1. items (name, description, price, etc)
2. carts
3. orders
4. aggregated stock (all of the item stocks on Amazon that are relevant to users)
5. warehouse orders
6. warehouse stock (must have physicalStock and availableStock)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

FB news feed high level overview

A
  1. 2 API calls, CreatePost and GetNewsFeed
  2. feed creation and storage strategy, then tie everything together
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

FB news feed storage

A

We can have one main relational database to store most of our system’s data, including posts and users. This database will have very large tables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Google Drive high level overview

A
  1. we’ll need to support the following operations:
    - files: upload, download, delete, rename, move
    - folders: create, get, rename, delete, move
  2. design storage solution for
    - entity (files and folders) metadata
    - file content
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Google Drive storage

A

To store entity info, we use K-V stores. Since we need high availability and data replication, we need to use something like Etcd, Zookeeper, or Google Cloud Spanner (as a K-V store) that gives us both of those guarantees as well as consistency (as opposed to DynamoDB, for instance, which would give us only eventual consistency).

To store file chunks, GCS.

To store blob reference counts, SQL table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Google Drive table

A

for files and folders.
both have:
id, is_folder (t/f), name, owner_id, parent_id

difference: files have blobs (array of blob hashes); folders have children (array of children IDs)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Netflix high level overview

A
  • Storage (Video Content, Static Content, and User Metadata)
  • General Client-Server Interaction (i.e., the life of a query)
  • Video Content Delivery
  • User-Activity Data Processing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Netflix storage

A
  • Since we’re only dealing with a few hundred terabytes of video content, we can use a simple blob storage solution like S3 or GCS.
  • Static content (titles, cast lists, descriptions) in a relational db or even in a document store, and we can cache most of it in our API servers.
  • User metadata in a classic relational db like Postgres.
  • User activity logs in HDFS
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Tinder high level overview

A
  • Overview
  • Profile Creation
  • Deck Generation
  • Swiping
    maybe super-liking and undoing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Tinder storage

A

Most of the data that we expect to store (profiles, decks, swipes, matches), is structured, so we’ll use SQL. All of this data will be stored in regional dbs, located based on user hot spots. We’ll have asynchronous replication btwn the regional dbs.

The only exception is users’ profile pics, which goes in a global blob store and will be served via CDN.

20
Q

Tinder table

A

all SQL.
1. profiles (each row is a profile)
2. users’ matches (each row is one user’s deck)
3. swipes
4. matches

21
Q

Slack high level overview

A

2 main sections:
- Handling what happens when a Slack app loads.
- Handling real-time messaging as well as cross-device synchronization.

22
Q

Slack storage

A

Since our tables will be very large, esp the messages table, we must shard. We can use a “smart” sharding solution; this service can be a strongly consistent key-value store like Etcd or ZooKeeper, mapping orgIds to shards

23
Q

Slack table

A

SQL db since we’re dealing w/ structured data that’s queried frequently. tables for:
1. channels
2. channel members (each row is a channel-member pair)
3. Historical Messages
4. Latest Channel Timestamps
5. Channel Read Receipts (lastSeen is updated whenever a user opens a channel)
6. number of unread user mentions

24
Q

AirBNB high level overview

A

HOST side and RENTER side. We further divide the renter side:
- Browsing listings.
- Getting a single listing.
- Reserving a listing.

25
Q

AirBNB storage

A

listings and reservations in SQL table (src of truth)

since we care about the latency of browsing listings on Airbnb, and since this browsing requires querying listings based on their location, we can store our listings in a region quadtree.
Since we’ll be storing our quadtree in memory, we must ensure that a single machine failure doesn’t bring down the entire browsing functionality. So we can set up a cluster of machines, each holding an instance of our quadtree in memory, and these machines can use leader election.

26
Q

AirBNB table

A

SQL tables for
- listings
- reservations

27
Q

AlgoExpert load balancing

A

we can have 2 primary clusters of backend servers in the 2 important regions: U.S. and India.

We can have some DNS load balancing to route API requests to the cluster closest to the user, and within a region, we can have path-based load balancing to separate our services (payments, authentication, code execution, etc.), esp since the code execution platform will probably need to run on different kinds of servers compared to those of the rest of the API.

28
Q

AlgoExpert caching

A

We can implement 2 layers of caching for static API content: client-side (1. users will only need to load questions once per session, 2. load on backend servers is reduced) and server-side.

29
Q

Stockbroker API call

A

PlaceTrade(customerId, stockTicker, type (BUY/SELL), quantity)
=>
(tradeId, stockTicker, type: string, quantity, createdAt, status (always PLACED))

we can imagine that a GetTrade API call could return the statuses IP, FILLED, REJECTED, along with a reason

30
Q

Stockbroker load balancing

A

We’ll need multiple API servers to handle all of the incoming requests. Since we don’t need any caching when making trades (don’t care about server stickiness), we can just use some round-robin load balancing to distribute incoming requests among our API servers.

31
Q

atomicity

A

Atomicity is a feature of databases systems dictating where a transaction must be all-or-nothing

32
Q

consistency

A

only valid data will be written; data cannot be written that would violate the database’s own rules for valid data

33
Q

Amazon caching

A

users call the GetItemCatalog(search) endpoint when they’re searching for items. The request is routed by API servers to the smart search-results service, which interacts directly with the items table, caches popular item searches, and returns the results.

34
Q

FB news feed API call

A

CreatePost(user_id, post)

GetNewsFeed(user_id, pageSize, nextPageToken) =>
(posts: [], nextPageToken)

supports pagination

35
Q

FB news feed sharding

A

based on the user id, NOT post id b/c then, news feed generation would require cross-shard joins

36
Q

G Drive sharding

A

We must shard our entity metadata, across multiple clusters K-V stores. Sharding on entityID means we’ll lose the ability to perform batch operations, which these K-V stores give us out of the box and which we’ll need when we move entities around.

We should shard based on ownerID, which means we can edit the metadata of multiple entities ATOMICALLY with a transaction.

37
Q

G Drive LB

A

Given the traffic that this website needs to serve, we can have a layer of proxies for entity information, load balanced on a hash of the ownerID. The proxies could have some caching

38
Q

Netflix sharding

A

we can split our user-metadata db into a handful of shards, based on user id, each managing anywhere between 1 and 10 TB of indexed data. This will maintain very quick reads and writes for a given user.

39
Q

write-through

A

a storage method in which data is written into the cache and the corresponding main memory location at the same time

40
Q

Netflix LB

A

We can use round-robin load balancing to distribute user network requests across our API servers, which can then load-balance db requests according to userId.

41
Q

Netflix caching

A

we can cache static content in our API servers, periodically updating it when new movies and shows are released, and we can also cache user metadata there, using write-through.

42
Q

Slack sharding

A

Since our tables are very large, esp msgs, we must shard.

Natural approach: shard based on org size. Biggest orgs in their own shards; smaller organizations grouped together in other shards.

BUT we’ll have problems when an org’s size increases dramatically or when activity surges within an org. Hotspots mean latency goes up.

So we add a service that’ll asynchronously measure org activity and “rebalance” shards accordingly. This service can be a strongly consistent K-V store (Etcd / ZooKeeper), mapping orgIds to shards. Our API servers communicates w/ this service to know which shard to route requests to.

43
Q

idempotent

A

a property of some operations such that no matter how many times you execute them, you achieve the same result

44
Q

Slack LB

A

We’ll want a load balancer btwn the clients and the API servers that sub to Kafka topics, which will also talk to the “smart” sharding service to match clients with the right API servers.

45
Q

Airbnb LB

A

Host side - requests to create and delete listings are LBed across a set of API servers using round-robin. The API servers are in charge of writing to the SQL db.

Renter side - LB requests to list, get, and reserve listings across a set of API servers using an API-path-based server-selection strategy.

Note: NO caching at our API servers – we’ll run into stale data as reservations and listings appear.