System Design - Web Crawler Flashcards
Functional Requirements?
Crawl web starting from a seed list of URLs
Extract text data from each webpage, store for later processing
New URLs added to crawl list, which come from fetched pages
Non-functional requirements
Assume 10 B webpages
with average size of 2 MB per webpage
Assume training data needs the fetched pages for 5 days
- Fault Tolerance to fault gracefully resumption of crawling if any node fails
- Politeness. Follow the rules of robots.txt
- Efficiency to crawl in under 5 days
- Scalability to handle 10 B pages
API / System Interface
Inputs: URLs to crawl
Outputs: text extracted from webpages
Data Flow
- Take seed URL from frontier and request IP from DNS
- Fetch HTML from external server using IP
- Extract text data from the HTML.
- Store the text data in a database.
- Extract any linked URLs from the web pages and add them to the list of URLs to crawl.
- Repeat steps 1-5 until all URLs have been crawled.
Tips:
Ask interviewer about how we get the initial list of URLs
High Level Design
Frontier Queue: The queue of URLs we need to crawl. We will start with a set of seed URLs and add new URLs as we crawl the web. The technology used could be something like Kafka, Redis, or SQS. We’ll decide on the technology later.
Crawler: Fetches web pages, extracts text data, and extracts new URLs to add to the frontier queue. In the next section we’ll talk about how to scale this component to handle the 10B pages we need to crawl.
DNS: Resolves domain names to IP addresses so that the crawler can fetch the web pages. There are interesting discussions to be had about how to cache DNS lookups, handle DNS failures, and ensure that we are not overloading DNS servers. Again, more on this later.
Webpage: TThe external server that hosts the web pages we are crawling. We’ll fetch the HTML from these servers and extract the text data.
S3 Text Data: This is where we’ll store the text data we extract from the web pages. We choose S3 as our blob storage because it is highly scalable and durable. It is designed to store large amounts of data cheaply. Other hosted storage solutions like Google Cloud Storage or Azure Blob Storage could also be used.
Deep Dive 1
How can we ensure we are fault tolerant and don’t lose progress?
Break the crawler into pipelined stages that can be retried if any of them fail. Stages:
1) URL fetching and save in Blob storage
2) Text and URL Extraction
Crawler (consumer) will read the URL from the message and try to fetch. If successful, will save to S3 and place the S3 URL in the Parser queue (yet another message queue.
Also, update a database row to record that the URL has be downloaded. Could also have a row that saves the S3 URL. This is good for auditing purposes. Might want to compare number of URLs crawled. how long the entire process took.
Need to retry in case the crawler unable to download the URL’s contents. Have exponential backoff with a max of 5 retries. SQS supports this. Is configurable.
Parser will read the S3 URL and extract the text and URLs. It will
DD2 - How to make crawler polite
Fetch the robots.txt file and make it respect the rules
Fetch the robots.txt file for the domain.
Parse the robots.txt file and store it in the Metadata DB.
When we pull a URL off the queue, check the rules stored in the Metadata DB for that domain.
If the URL is disallowed, ack the message and move on to the next URL.
If the URL is allowed, check the Crawl-delay directive.
If the Crawl-delay time has not passed since the last crawl, put the URL back on the queue with the appropriate delay.
If the Crawl-delay time has passed, crawl the page and update the last crawl time for the domain.
DD3 - How to fetch 10B pages in under 5 days?
Do some math first:
400 Gbps / 8 bits/byte / 2MB/page = 25,000 pages/second
Assume 30% actual utilization
25,000 pages/second * 30% = 7,500 pages/second
10,000,000,000 pages / 7,500 pages/second = 1,333,333 seconds = 15.4 days for a single machine
15.4 days / 4 machines = 3.85 days. This is under our 5-day requirement, so we’re good to go
To scale the Parser workers:
Scale this up and down dynamically based on the number of pages in the Further Processing Queue
How to deal with DNS lookups?
DNS caching: We can cache DNS lookups in our crawlers to reduce the number of DNS requests we need to make. This way all URLs to the same domain will reuse the same DNS lookup.
Multiple DNS providers: We can use multiple DNS providers and round-robin between them. This can help distribute the load across multiple providers and reduce the risk of hitting rate limits.
How to check if we’ve crawled a domain before?
What if two webpages from different domains have the exact same content?
Check if URL has been crawled before by checking the Meta DB. There could be an secondary index on the URL.
Before parsing a page, hash the content, MD5 for example and compare with another indexes of hashes.
Crawler traps. How to avoid them?
We can add a depth field to our URL table in the Metadata DB and increment this field each time we follow a link. If the depth exceeds a certain threshold, we can stop crawling the page
Other deep dives
How to handle dynamic content: Many websites are built with JavaScript frameworks like React or Angular. This means that the content of the page is not in the HTML that is returned by the server, but is instead loaded dynamically by the JavaScript. To handle this, we’ll need to use a headless browser like Puppeteer to render the page and extract the content.
How to monitor the health of the system: We’ll want to monitor the health of the system to ensure that everything is running smoothly. We can use a monitoring service like Datadog or New Relic to monitor the performance of the crawlers and parser workers and to alert us if anything goes wrong.
How to handle large files: Some websites have very large files that we may not want to download. We can use the Content-Length header to determine the size of the file before downloading it and skip files that are too large.
How to handle continual updates: While our requirements are for a one-time crawl, we may want to consider how we would handle continual updates to the data. This could be that we plan to re-train the model every month or that our crawler is for a search engine that needs to be updated regularly. I’d suggest adding a new component “URL Scheduler” that is responsible for scheduling URLs to be crawled. So rather than putting URLs on the queue directly from the parser workers, the parser workers would put URLs in the Metadata DB and the URL Scheduler would be responsible for scheduling URLs to be crawled by using some logic base on last crawl time, popularity, etc.