System Design Week 3 - Web Crawler Flashcards
Problem 1) Web Crawler
Problem statement : Design a web crawler that does the following tasks -
Data indexing - search and indexing
web archiving - collects and store data for future use
web mining - extract and use knowledge
web monitoring - monitor certain activities such as price changes on competitors, websites, copyright violations etc.
Clarifying questions :
question 1 : Is data constantly changing or its static data ?
question 2 : clarify the purpose - indexing, archiving or something else.
question 3 : example of the query and expected output.
Question 1) List down the functional requirements
List down the functional requirements:
- crawl HTML pages(can exclude photos, images, videos etc)
- avoid duplicate content.
- store for 5 years
- should not send too many requests to any request.
- don’t crawl links specified in robots.txt
monthly refresh. - avoid spam, malicious content, infinite loops.
Problem 1) Web Crawler
Problem statement : Design a web crawler that does the following tasks -
Data indexing - search and indexing
web archiving - collects and store data for future use
web mining - extract and use knowledge
web monitoring - monitor certain activities such as price changes on competitors, websites, copyright violations etc.
Clarifying questions :
question 1 : Is data constantly changing or its static data ?
question 2 : clarify the purpose - indexing, archiving or something else.
question 3 : example of the query and expected output.
Question 2) List down the non functional Requirements
List down the non functional Requirements
- total pages ~= 3billion
- operations per second ~= 3 * 10^9 / 30days / ~100,000 s per day ~= 1000 ops
- average web page size = 100kb
- Storage requirement = 3 * 10^9 * 10^5 *12 * 5(monthly for 5 years) = ~18pb
Problem 1) Web Crawler
Problem statement : Design a web crawler that does the following tasks -
Data indexing - search and indexing
web archiving - collects and store data for future use
web mining - extract and use knowledge
web monitoring - monitor certain activities such as price changes on competitors, websites, copyright violations etc.
Clarifying questions :
question 1 : Is data constantly changing or its static data ?
question 2 : clarify the purpose - indexing, archiving or something else.
question 3 : example of the query and expected output.
Question 3) Microservices
Microservices
- Seed URLs - starting point of the crawl process
- URL Frontier - urls that are to be downloaded
- html downloader
- content parser - parse and validate web pages, indexing
- content seen service - avoid parsing duplicate pages
- url extractor - extract urls from pages, convert relative path to absolute
- url filter- remove blocked urls, error links
- url seen service - exclude already visited urls
Problem 1) Web Crawler
Problem statement : Design a web crawler that does the following tasks -
Data indexing - search and indexing
web archiving - collects and store data for future use
web mining - extract and use knowledge
web monitoring - monitor certain activities such as price changes on competitors, websites, copyright violations etc.
Clarifying questions :
question 1 : Is data constantly changing or its static data ?
question 2 : clarify the purpose - indexing, archiving or something else.
question 3 : example of the query and expected output.
Question 4) Logical architecture
https://drive.google.com/file/d/1WVd4LPGphr1urmDb_7rZ_EzY7L6W2pWS/view?usp=sharing
Problem 1) Web Crawler
Problem statement : Design a web crawler that does the following tasks -
Data indexing - search and indexing
web archiving - collects and store data for future use
web mining - extract and use knowledge
web monitoring - monitor certain activities such as price changes on competitors, websites, copyright violations etc.
Clarifying questions :
question 1 : Is data constantly changing or its static data ?
question 2 : clarify the purpose - indexing, archiving or something else.
question 3 : example of the query and expected output.
Question 5) Schema design
Schema design
- list of urls
Problem 1) Web Crawler
Problem statement : Design a web crawler that does the following tasks -
Data indexing - search and indexing
web archiving - collects and store data for future use
web mining - extract and use knowledge
web monitoring - monitor certain activities such as price changes on competitors, websites, copyright violations etc.
Clarifying questions :
question 1 : Is data constantly changing or its static data ?
question 2 : clarify the purpose - indexing, archiving or something else.
question 3 : example of the query and expected output.
Question 6) API Design
API Design
- GET Fetch set of urls
- POST SeedUrls
- All CRUD operations
Problem 1) Web Crawler
Problem statement : Design a web crawler that does the following tasks -
Data indexing - search and indexing
web archiving - collects and store data for future use
web mining - extract and use knowledge
web monitoring - monitor certain activities such as price changes on competitors, websites, copyright violations etc.
Clarifying questions :
question 1 : Is data constantly changing or its static data ?
question 2 : clarify the purpose - indexing, archiving or something else.
question 3 : example of the query and expected output.
Question 7) Business Logic
Business Logic
- Fetch a set of URLs from the DB to pass it to the URL frontier.
- Seed URLs should allow to find and traverse maximum links(eg. yahoo.com, bbc.com, wikipedia.org)
- Selection of starting URLs could be based on topics, country etc.
Problem 1) Web Crawler
Problem statement : Design a web crawler that does the following tasks -
Data indexing - search and indexing
web archiving - collects and store data for future use
web mining - extract and use knowledge
web monitoring - monitor certain activities such as price changes on competitors, websites, copyright violations etc.
Clarifying questions :
question 1 : Is data constantly changing or its static data ?
question 2 : clarify the purpose - indexing, archiving or something else.
question 3 : example of the query and expected output.
Question 8) Design Considerations
CAP theorem
AP system, must explain why AP and not CP
Scaling
Must discuss the reasons which all applicable among below-
- Scale for storage
- Scale for throughput
- Scale for API parallelization
- Need to remove hotspot
- Availability and Geo distribution
Here the main reasons are - availability and geo distribution.
Sharding
- Explanation why(or why not) sharding is - required here
- Vertical or horizontal sharding is required.
- What will be the partition key?
- Fixed number of shards or dynamic shard servers are required.
- Consistent hashing must be mentioned with dynamic number of shards
This db is not that important to shard still here it can be vertical sharding is included. Since the seed list of urls are maintained thus it can be static shard on simple hash function.
Replication
Required. Must explain reason
eg. for availability as well as throughput
Caching
Not required here, must explain why.
API Parallelisation
Must explain well that API parallelization is required only when APIs are bulky. Here no.
GeoDistribution
Geo distribution of data is not required here. Must be called out if it is required or not, and why.
Load Balancing
Can be included.
Purging/ Cleanup
as per the requirements 5 years
Problem 1) Web Crawler
Problem statement : Design a web crawler that does the following tasks -
Data indexing - search and indexing
web archiving - collects and store data for future use
web mining - extract and use knowledge
web monitoring - monitor certain activities such as price changes on competitors, websites, copyright violations etc.
Clarifying questions :
question 1 : Is data constantly changing or its static data ?
question 2 : clarify the purpose - indexing, archiving or something else.
question 3 : example of the query and expected output.
Question 9) Architecture