sdi Flashcards
tinyurl: capacity estimation
read-heavy. There will be lots of redirection requests compared to new URL shortenings. 100:1
traffic: ask num of new URL shortenings per month (500M)
storage: each stored obj is 500 bytes. ask how long we want to store each shortening request
tinyurl: APIs
createURL(api_dev_key, original_url, custom_alias=None, user_name=None,
expire_date=None)
Parameters:
api_dev_key (string): The API developer key of a registered account. This will be used to, among other
things, throttle users based on their allocated quota.
original_url (string): Original URL to be shortened.
custom_alias (string): Optional custom key for the URL.
user_name (string): Optional user name to be used in encoding.
expire_date (string): Optional expiration date for the shortened URL.
A successful creation returns the shortened URL; otherwise, it returns an error code.
deleteURL(api_dev_key, url_key)
Where “url_key” is a string representing the shortened URL to be retrieved. A successful deletion
returns ‘URL Removed’.
tinyurl DB schema
2 tables: one for URL mappings, and one for the user’s data.
for URL -
Hash: varchar(16)
OriginalURL: varchar(512)
CreationDate: datetime
ExpirationDate: datatime
for User -
UserID: int
Name: varchar(20)
Email: varchar(32)
CreationDate: datetime
LastLogin: datatime
tinyurl What kind of database should we use?
Since we anticipate storing billions of rows, and we don’t
need to use relationships between objects – a NoSQL key-value store like DynamoDB, Cassandra is a better choice. A NoSQL choice would also be easier to scale.
KGS
- generates unique, random six letter strings beforehand and stores them in a database
- 2 tables: one for used, one for unused
KGS can keep some keys in memory so it can quickly provide them whenever a server needs them. For simplicity, as soon as KGS loads some keys in memory, it moves them to used keys table.
This ensures each server gets unique keys.
if KGS dies before assigning all keys, it’s acceptable, given the huge number of keys we have.
Isn’t KGS a single point of failure?
Yes, it is. To solve this, we can have a standby replica of KGS.
Whenever the primary server dies, standby takes over
how to make sure KGS doesn’t give the same key to multiple servers?
it must synchronize
(or get a lock on) the data structure holding the keys before removing keys from it `
tinyurl - key lookup process
We can look up the key in our database or key-value store to
get the full URL. If it’s present, issue an “HTTP 302 Redirect” status back to the browser, passing the
stored URL in the “Location” field of the request. If not, issue an
“HTTP 404 Not Found” status or redirect the user back to the homepage.
Range Based Partitioning:
We can store URLs in separate partitions based on 1st letter of the URL.
We can even combine certain less frequently occurring letters into one database partition.
But this can lead to unbalanced servers. Ex: we put all URLs starting with letter ‘E’ into a DB partition, but too many
URLs start with letter ‘E’.
Hash-Based Partitioning:
take a hash of either the whole URL or the shortened url. our hashing function
can always map any key to a number between [1…256]), and this number would represent the partition
in which we store our object.
This approach can still lead to overloaded partitions, which can be solved by using Consistent Hashing.
tinyurl caching
We can use some off-the-shelf solution like
Memcache, which can store full URLs with their respective hashes.
estimate cache size.
LRU eviction policy.
we can use cache replicas to distribute load. Whenever there is a cache miss, our servers would be
hitting a backend database. Whenever this happens, we can update the cache and pass the new entry to all replicas. Each replica can update their cache by adding the new entry. If a replica already
has that entry, it can simply ignore it.
round robin LB
distributes incoming requests equally
among backend servers. This LB is simple to implement and does not introduce any overhead. Another
benefit of this approach is that if a server is dead, LB will take it out of the rotation and will stop
sending any traffic to it.
A problem with Round Robin LB is that if a server is
overloaded or slow, the LB will not stop sending new requests to that server.
lazy cleanup
active search for expired links to remove them would put a lot of pressure on our database. do lazy cleanup instead. only expired links will be deleted, although some expired links can live longer but will never be returned to users.
* Whenever a user tries to access an expired link, we can delete the link and return an error to the
user.
* A separate Cleanup service removes expired links from our storage and cache. lightweight and scheduled to run only when low user traffic
* After removing an expired link, we can put the key back in the key-DB to be reused.
pastebin
capacity estimation
5:1 r:w
1M new pastes daily.
Each metadata object we are storing would be small (< 100 bytes).
Each paste object we are storing can be of medium size (it can be a few MB).
pastebin APIs
addPaste(api_dev_key, paste_data, custom_url=None user_name=None, paste_name=None,
expire_date=None)
getPaste(api_dev_key, api_paste_key)
deletePaste(api_dev_key, api_paste_key)
A successful deletion returns ‘true’, otherwise returns ‘false’.