System Design Week 3 - Web Crawler Flashcards

Question 1

Q

Problem 1) Web Crawler

Problem statement : Design a web crawler that does the following tasks -
Data indexing - search and indexing
web archiving - collects and store data for future use
web mining - extract and use knowledge
web monitoring - monitor certain activities such as price changes on competitors, websites, copyright violations etc.

Clarifying questions :
question 1 : Is data constantly changing or its static data ?
question 2 : clarify the purpose - indexing, archiving or something else.
question 3 : example of the query and expected output.

Question 1) List down the functional requirements

Answer

A

List down the functional requirements:

crawl HTML pages(can exclude photos, images, videos etc)
avoid duplicate content.
store for 5 years
should not send too many requests to any request.
don’t crawl links specified in robots.txt
monthly refresh.
avoid spam, malicious content, infinite loops.

Question 2

Q

Problem 1) Web Crawler

Problem statement : Design a web crawler that does the following tasks -
Data indexing - search and indexing
web archiving - collects and store data for future use
web mining - extract and use knowledge
web monitoring - monitor certain activities such as price changes on competitors, websites, copyright violations etc.

Clarifying questions :
question 1 : Is data constantly changing or its static data ?
question 2 : clarify the purpose - indexing, archiving or something else.
question 3 : example of the query and expected output.

Question 2) List down the non functional Requirements

Answer

A

List down the non functional Requirements

total pages ~= 3billion
operations per second ~= 3 * 10^9 / 30days / ~100,000 s per day ~= 1000 ops
average web page size = 100kb
Storage requirement = 3 * 10^9 * 10^5 *12 * 5(monthly for 5 years) = ~18pb

Question 3

Q

Problem 1) Web Crawler

Problem statement : Design a web crawler that does the following tasks -
Data indexing - search and indexing
web archiving - collects and store data for future use
web mining - extract and use knowledge
web monitoring - monitor certain activities such as price changes on competitors, websites, copyright violations etc.

Clarifying questions :
question 1 : Is data constantly changing or its static data ?
question 2 : clarify the purpose - indexing, archiving or something else.
question 3 : example of the query and expected output.

Question 3) Microservices

Answer

A

Microservices

Seed URLs - starting point of the crawl process
URL Frontier - urls that are to be downloaded
html downloader
content parser - parse and validate web pages, indexing
content seen service - avoid parsing duplicate pages
url extractor - extract urls from pages, convert relative path to absolute
url filter- remove blocked urls, error links
url seen service - exclude already visited urls

Question 4

Q

Problem 1) Web Crawler

Problem statement : Design a web crawler that does the following tasks -
Data indexing - search and indexing
web archiving - collects and store data for future use
web mining - extract and use knowledge
web monitoring - monitor certain activities such as price changes on competitors, websites, copyright violations etc.

Clarifying questions :
question 1 : Is data constantly changing or its static data ?
question 2 : clarify the purpose - indexing, archiving or something else.
question 3 : example of the query and expected output.

Question 4) Logical architecture

Answer

A

https://drive.google.com/file/d/1WVd4LPGphr1urmDb_7rZ_EzY7L6W2pWS/view?usp=sharing

Question 5

Q

Problem 1) Web Crawler

Problem statement : Design a web crawler that does the following tasks -
Data indexing - search and indexing
web archiving - collects and store data for future use
web mining - extract and use knowledge
web monitoring - monitor certain activities such as price changes on competitors, websites, copyright violations etc.

Clarifying questions :
question 1 : Is data constantly changing or its static data ?
question 2 : clarify the purpose - indexing, archiving or something else.
question 3 : example of the query and expected output.

Question 5) Schema design

Answer

A

Schema design

list of urls

Question 6

Q

Problem 1) Web Crawler

Problem statement : Design a web crawler that does the following tasks -
Data indexing - search and indexing
web archiving - collects and store data for future use
web mining - extract and use knowledge
web monitoring - monitor certain activities such as price changes on competitors, websites, copyright violations etc.

Clarifying questions :
question 1 : Is data constantly changing or its static data ?
question 2 : clarify the purpose - indexing, archiving or something else.
question 3 : example of the query and expected output.

Question 6) API Design

Answer

A

API Design

GET Fetch set of urls
POST SeedUrls
All CRUD operations

Question 7

Q

Problem 1) Web Crawler

Problem statement : Design a web crawler that does the following tasks -
Data indexing - search and indexing
web archiving - collects and store data for future use
web mining - extract and use knowledge
web monitoring - monitor certain activities such as price changes on competitors, websites, copyright violations etc.

Clarifying questions :
question 1 : Is data constantly changing or its static data ?
question 2 : clarify the purpose - indexing, archiving or something else.
question 3 : example of the query and expected output.

Question 7) Business Logic

Answer

A

Business Logic

Fetch a set of URLs from the DB to pass it to the URL frontier.
Seed URLs should allow to find and traverse maximum links(eg. yahoo.com, bbc.com, wikipedia.org)
Selection of starting URLs could be based on topics, country etc.

Question 8

Q

Problem 1) Web Crawler

Problem statement : Design a web crawler that does the following tasks -
Data indexing - search and indexing
web archiving - collects and store data for future use
web mining - extract and use knowledge
web monitoring - monitor certain activities such as price changes on competitors, websites, copyright violations etc.

Clarifying questions :
question 1 : Is data constantly changing or its static data ?
question 2 : clarify the purpose - indexing, archiving or something else.
question 3 : example of the query and expected output.

Question 8) Design Considerations

Answer

A

CAP theorem
AP system, must explain why AP and not CP

Scaling
Must discuss the reasons which all applicable among below-
- Scale for storage
- Scale for throughput
- Scale for API parallelization
- Need to remove hotspot
- Availability and Geo distribution
Here the main reasons are - availability and geo distribution.

Sharding
- Explanation why(or why not) sharding is - required here
- Vertical or horizontal sharding is required.
- What will be the partition key?
- Fixed number of shards or dynamic shard servers are required.
- Consistent hashing must be mentioned with dynamic number of shards
This db is not that important to shard still here it can be vertical sharding is included. Since the seed list of urls are maintained thus it can be static shard on simple hash function.

Replication
Required. Must explain reason
eg. for availability as well as throughput

Caching
Not required here, must explain why.

API Parallelisation
Must explain well that API parallelization is required only when APIs are bulky. Here no.

GeoDistribution
Geo distribution of data is not required here. Must be called out if it is required or not, and why.

Load Balancing
Can be included.

Purging/ Cleanup
as per the requirements 5 years

Question 9

Q

Problem 1) Web Crawler

Problem statement : Design a web crawler that does the following tasks -
Data indexing - search and indexing
web archiving - collects and store data for future use
web mining - extract and use knowledge
web monitoring - monitor certain activities such as price changes on competitors, websites, copyright violations etc.

Clarifying questions :
question 1 : Is data constantly changing or its static data ?
question 2 : clarify the purpose - indexing, archiving or something else.
question 3 : example of the query and expected output.

Question 9) Architecture

System Design Week 3 - Web Crawler Flashcards

(9 cards)