System Design Flashcards
Acronym for system design checklist
SAD Ninja FAR Read/ write S [from Cracking the coding interview] ----------------------------------------------------- Scalability Asynchronous operations or not? Database Network metrics Failures Availability Reliability Read/ write heavy Security
Also (SEARS) ----------------------------------------------------- Scalability Efficiency Availability Reliability Serviceability or Manageability
Design a web crawler
To crawler a single web page, all we need is to issue a HTTP GET request to the corresponding URL and parse the response data, which is kind of the core of a crawler. With that in mind, a basic web crawler can work like this:
Start with a URL pool that contains all the websites we want to crawl.
For each URL, issue a HTTP GET request to fetch the web page content.
Parse the content (usually HTML) and extract potential URLs that we want to crawl.
Add new URLs to the pool and keep crawling.
How to ensure that the same url is not added in the pool in a distributed system?
One common approach is to use Bloom Filter. In a nutshell, a bloom filter is a space-efficient system that allows you to test if an element is in a set. However, it may have false positive. In other words, a bloom filter can tell you that , either a URL is definitely not in the pool or it is probably in the pool.
To briefly explain how bloom filter works, an empty bloom filter is a bit array of m bits (all 0). There are also k hash functions that map each element to one of the m bits. So when we add a new element (URL) into the bloom filter, we will get k bits from the hash functions and set all of them to 1. Thus, when we check the existence of an element, we first get the k bits for it and if any of them is not 1, we know immediately that the element doesn’t exist. However, if all of the k bits are 1, this can come from the combination of several other elements.
Bloom filter is a very commonly used technique and it’s the perfect solution for deduping URLs in a web crawler.
Advantages and Disadvantages of SQL databases
Advantages:
- Faster Query Processing
- No Coding Skills
- Standardized Language
- Portable
- Easy to learn
Disadvantages
- Interface can get complex
- can be costly
- Need to have relationships between data
Advantages and Disadvantages of No -SQL databases
Advantages
- It stores as a key value system
- Flexibiltiy to add/remove columns
- In Sql ,you may need joins sometimes , to retrieve data from different tables , but here all your required data is contained in one block. So it’s little easier to insert and retrieve data from NoSql.
- These DB’s are built for finding metrics and getting intelligent data
Disadvantages
- Relations and Joins are hard to apply
- Reads can be slow
- In NoSQL two same id’s may have totally different data which makes updates difficult
How can you do async processing?
RabbitMQ. RabbitMQ is one of many systems which help to implement async processing.
No SQL DB types
Key-Value Stores: Data is stored in an array of key-value pairs. Well-known key-value stores include Redis, Voldemort, and Dynamo. Used for rapidly changing data(like in memory cache) as it gives better performance
Document Databases: In these databases, data is stored in documents and these documents are grouped together in collections. Its a key-value store with documents stored as values. Each document can have an entirely different structure. CouchDB and MongoDB. provide high flexibility and are often used for working with occasionally changing data.
Wide-Column Databases: Instead of ‘tables,’ we have column families, which are containers for rows. Unlike relational databases, we don’t need to know all the columns up front and each row doesn’t have to have the same number of columns. Columnar databases are best suited for analyzing large datasets - big names include Cassandra and HBase.
Graph Databases: These databases are used to store data whose relations are best represented in a graph. Data is saved in graph structures with nodes (entities), properties (information about the entities), and lines (connections between the entities). Examples of graph database include Neo4J and InfiniteGraph. used to model complex relations
80-20 rule in system design
it means 20% of the requests generates 80% traffic
How many bits does each base64 encoding digit represent
Each non-final Base64 digit represents exactly 6 bits of data
Data Partitioning and Replication
- Range Based Partitioning(DB Sharding): We can store URLs in separate partitions based on the hash key’s first letter. Hence we save all the URLs starting with the letter ‘A’ (and ‘a’) in one partition, save those that start with the letter ‘B’ in another partition, and so on. It should be a static partitioning scheme to always store/find a URL in a predictable manner.
The main problem with this approach is that it can lead to unbalanced partitions. For example, we decide to put all URLs starting with the letter ‘E’ into a DB partition, but later we realize that we have too many URLs that start with the letter ‘E.’
- Hash-Based Partitioning: In this scheme, we take a hash of the object we are storing. We then calculate which partition to use based upon the hash. In our case, we can take the hash of the ‘key’ or the short link to determine the partition in which we store the data object.
Our hashing function will randomly distribute URLs into different partitions This approach can still lead to overloaded partitions
How much memory do modern day servers have
modern-day server can have 256GB memory,
XMPP
XMPP is the Extensible Messaging and Presence Protocol, a set of open technologies for instant messaging, presence, multi-party chat, voice and video calls, collaboration, lightweight middleware, content syndication, and generalized routing of XML data.
Traffic can be represented as ?
URL’s/s. Like how many redirect url’s come in for TinyURL
What would be Queries Per Second (QPS) for our system? New URLs shortenings per second: assuming we get 500 Million new url’s generated/month
500 million / (30 days * 24 hours * 3600 seconds) = ~200 URLs/s
Bandwidth can be represented as ?
this is KB/s.
For read requests, since every second we expect ~20K URLs redirections, total outgoing data for our service would be 10MB per second:
20K * 500 bytes = ~10 MB/s
DNS Servers vs ISP Servers
Internet service providers (ISPs) bring the bandwidth , from where it’s produced, to wherever you happen to be sitting, wanting to use it. They’re the freight trucking companies of the Internet. If they do a poor job, the bandwidth is slow, packet delivery is unreliable, and performance is unpredictable. Eg like comcast, AT&T
The Domain Name System (DNS) is like the telephone directory of the Internet, which your computer or mobile device use to look up the addresses of the web sites you want to visit