sdi Flashcards by ace ang

tinyurl: capacity estimation

read-heavy. There will be lots of redirection requests compared to new URL shortenings. 100:1

traffic: ask num of new URL shortenings per month (500M)

storage: each stored obj is 500 bytes. ask how long we want to store each shortening request

How well did you know this?

Not at all

Perfectly

tinyurl: APIs

createURL(api_dev_key, original_url, custom_alias=None, user_name=None,
expire_date=None)
Parameters:
api_dev_key (string): The API developer key of a registered account. This will be used to, among other
things, throttle users based on their allocated quota.
original_url (string): Original URL to be shortened.
custom_alias (string): Optional custom key for the URL.
user_name (string): Optional user name to be used in encoding.
expire_date (string): Optional expiration date for the shortened URL.

A successful creation returns the shortened URL; otherwise, it returns an error code.

deleteURL(api_dev_key, url_key)
Where “url_key” is a string representing the shortened URL to be retrieved. A successful deletion
returns ‘URL Removed’.

How well did you know this?

Not at all

Perfectly

tinyurl DB schema

2 tables: one for URL mappings, and one for the user’s data.

for URL -
Hash: varchar(16)
OriginalURL: varchar(512)
CreationDate: datetime
ExpirationDate: datatime

for User -
UserID: int
Name: varchar(20)
Email: varchar(32)
CreationDate: datetime
LastLogin: datatime

How well did you know this?

Not at all

Perfectly

tinyurl What kind of database should we use?

Since we anticipate storing billions of rows, and we don’t
need to use relationships between objects – a NoSQL key-value store like DynamoDB, Cassandra is a better choice. A NoSQL choice would also be easier to scale.

How well did you know this?

Not at all

Perfectly

KGS

generates unique, random six letter strings beforehand and stores them in a database
2 tables: one for used, one for unused

KGS can keep some keys in memory so it can quickly provide them whenever a server needs them. For simplicity, as soon as KGS loads some keys in memory, it moves them to used keys table.
This ensures each server gets unique keys.

if KGS dies before assigning all keys, it’s acceptable, given the huge number of keys we have.

How well did you know this?

Not at all

Perfectly

Isn’t KGS a single point of failure?

Yes, it is. To solve this, we can have a standby replica of KGS.
Whenever the primary server dies, standby takes over

How well did you know this?

Not at all

Perfectly

how to make sure KGS doesn’t give the same key to multiple servers?

it must synchronize
(or get a lock on) the data structure holding the keys before removing keys from it `

How well did you know this?

Not at all

Perfectly

tinyurl - key lookup process

We can look up the key in our database or key-value store to
get the full URL. If it’s present, issue an “HTTP 302 Redirect” status back to the browser, passing the
stored URL in the “Location” field of the request. If not, issue an
“HTTP 404 Not Found” status or redirect the user back to the homepage.

How well did you know this?

Not at all

Perfectly

Range Based Partitioning:

We can store URLs in separate partitions based on 1st letter of the URL.
We can even combine certain less frequently occurring letters into one database partition.

But this can lead to unbalanced servers. Ex: we put all URLs starting with letter ‘E’ into a DB partition, but too many
URLs start with letter ‘E’.

How well did you know this?

Not at all

Perfectly

Hash-Based Partitioning:

take a hash of either the whole URL or the shortened url. our hashing function
can always map any key to a number between [1…256]), and this number would represent the partition
in which we store our object.

This approach can still lead to overloaded partitions, which can be solved by using Consistent Hashing.

How well did you know this?

Not at all

Perfectly

tinyurl caching

We can use some off-the-shelf solution like
Memcache, which can store full URLs with their respective hashes.

estimate cache size.

LRU eviction policy.

we can use cache replicas to distribute load. Whenever there is a cache miss, our servers would be
hitting a backend database. Whenever this happens, we can update the cache and pass the new entry to all replicas. Each replica can update their cache by adding the new entry. If a replica already
has that entry, it can simply ignore it.

How well did you know this?

Not at all

Perfectly

round robin LB

distributes incoming requests equally
among backend servers. This LB is simple to implement and does not introduce any overhead. Another
benefit of this approach is that if a server is dead, LB will take it out of the rotation and will stop
sending any traffic to it.
A problem with Round Robin LB is that if a server is
overloaded or slow, the LB will not stop sending new requests to that server.

How well did you know this?

Not at all

Perfectly

lazy cleanup

active search for expired links to remove them would put a lot of pressure on our database. do lazy cleanup instead. only expired links will be deleted, although some expired links can live longer but will never be returned to users.
* Whenever a user tries to access an expired link, we can delete the link and return an error to the
user.
* A separate Cleanup service removes expired links from our storage and cache. lightweight and scheduled to run only when low user traffic
* After removing an expired link, we can put the key back in the key-DB to be reused.

How well did you know this?

Not at all

Perfectly

pastebin
capacity estimation

5:1 r:w
1M new pastes daily.
Each metadata object we are storing would be small (< 100 bytes).
Each paste object we are storing can be of medium size (it can be a few MB).

How well did you know this?

Not at all

Perfectly

pastebin APIs

addPaste(api_dev_key, paste_data, custom_url=None user_name=None, paste_name=None,
expire_date=None)

getPaste(api_dev_key, api_paste_key)

deletePaste(api_dev_key, api_paste_key)
A successful deletion returns ‘true’, otherwise returns ‘false’.

How well did you know this?

Not at all

Perfectly

pastebin db schema

Study These Flashcards

There are no relationships between records, except if we want to store which user created what Paste.
We would need two tables, one for storing information about the Pastes and the other for users’ data.

URLHash
- PasteKey, ExpirationDate, CreationDate

UserID
- name, email, creationdate, lastlogin

pastebin component design

Study These Flashcards

application layer: KGS
datastore layer: We can divide our datastore layer into 2.

Metadata database: We can use a relational database like MySQL or a Distributed Key-Value
store like Dynamo or Cassandra.
Object storage: We can store our contents in an Object Storage like Amazon’s S3. Whenever we almost hit full capacity on content storage, we can easily increase it by adding more servers.

insta design considerations

Study These Flashcards

Practically, users can upload as many photos as they like. Efficient management of storage
should be a crucial factor while designing this system.
Low latency is expected while viewing photos.
Data should be 100% reliable. If a user uploads a photo, the system will guarantee that it will
never be lost.

insta db schema

Study These Flashcards

photo table: pk is PhotoID.
- UserID, photopath, photolatitude, photolongitude, userlat, userlong, creationdate
user table: pk is userID
- name, email, dob, creationdate, lastlogin
user PHOTO table: pk is userID
- ‘value’ is the list of ‘PhotoIDs’ the user owns, stored in different columns
user FOLLOW table: similar to 3

insta storage method

Study These Flashcards

store photos in a distributed file storage like HDFS or S3.
photoID and userID tables can be stored in a distributed key-value store to enjoy the benefits offered by NoSQL.
user photo and user follow tables in a wide-column datastore like Cassandra.

cassandra

Study These Flashcards

Cassandra or key-value stores in general, always maintain a certain number of replicas to offer
reliability. Also, in such data stores, deletes don’t get applied instantly, data is retained for certain days
(to support undeleting) before getting removed from the system permanently.

insta component design

Study These Flashcards

Photo uploads (writes) can be slow as they have to go to the disk; reads will be faster, esp if from cache.
Uploading users can consume all the available connections, as uploading is a slow process. This means that ‘reads’ cannot be served if the system gets busy with all the write requests. We should keep in mind that web servers have a connection limit before designing our system. If we assume that a web server
can have a maximum of 500 connections at any time, then it can’t have more than 500 concurrent uploads or reads.
To handle this bottleneck we can split reads and writes into separate services. We will have dedicated servers for reads and different servers for writes to ensure that uploads don’t hog the
system.

insta losing files

Study These Flashcards

Losing files is not an option for our service. Therefore, we will store multiple copies of each file so that
if one storage server dies we can retrieve the photo from the other copy present on a different storage
server.
This same principle also applies to other components of the system. If we want to have high availability
of the system, we need to have multiple replicas of services running in the system, so that if a few services die down the system still remains available and running.

Redundancy removes the single point of failure in the system.

Partitioning based on UserID

Study These Flashcards

keep all photos of a user on the same shard
find the shard number by UserID % (num of shards) and then store the data there.
Each DB shard can have its own auto-increment sequence for
PhotoIDs append ShardID to each PhotoID to get final photo id.

problems:
- hot users
- users w/ a lot of photos, making non uniform distribution of storage
- If we distribute photos of a user onto multiple shards will it cause higher latencies?
- Storing all photos of a user on one shard can cause issues like unavailability of all of the user’s
data if that shard is down or higher latency if the shard is serving high load

reasonable size for 1 db shard

1 tb

Partitioning based on PhotoID

generate unique PhotoIDs first, using KGS, and then find a shard number through “PhotoID % 10”

insta caching

- cache 20% of daily read volume of photos and metadata - we need a massive-scale photo delivery system to serve globally distributed users. we should push content closer to the user using a large number of geographically distributed photo cache servers and use CDNs - We can introduce a cache for metadata servers to cache hot database rows. We can use Memcache to cache the data. Application servers, before hitting database, can quickly check if the cache has desired rows. - LRU eviction policy

sdi Flashcards

(27 cards)