Web Crawling Flashcards
What is the cycle of the web crawling process?
- The crawler downloads a set of seed pages that are parsed and scanned for new links
- The links that have not yet been downloaded are added to a central queue for later download
- The crawler selects a new page for download and the process repeats until a stop criterion is met
What 2 things must a web crawler be?
- Must be robust: Be immune to traps and malicious behaviour from web servers
- Be polite: Respect implicit and explicit politeness considerations
What is the difference between explicit and implicit politeness?
Explicit is following specifications from webmasters on what portions of the site can be crawled
Implicit is following rules with no specification like not hitting a site too often
What is robots.txt?
A file that lives on a web server specifying access restrictions to tell crawlers what can and cannot be crawled
What things should a web crawler do?
- Be capable of running on distributed machines
- Be scalable, add more machines to increase the crawl rate
- Fully use available processing and network resources for efficiency
- Fetch pages of higher quality first
- Continue fetching fresh copies of a previously fetched page
- Be able to adapt to new data formats and protocols
What is the difference between a general web search and a vertical web search?
General is the type done by large search engines. Vertical is when a set of target pages is delineated by a topic, country, or language
What is the coverage and quality balance?
Crawlers must balance having many pages that could be used to answer many queries with the fact that pages should be of high quality
What is a vertical web crawler?
A web crawler focused on a particular subset of the web which may be defined geographically, linguistically, topically, by data type
How can a web crawler be used to analyze a web site?
- Scanning for broken links
- Validating code of pages
- Analyzing directories by finding sites that are no longer available
- Looking for security issues in structure, code, or scripts
What are the 3 axes by which we classify crawlers?
- Quality: trusworthiness, veracity
- Freshness: volatility and newness of info
- Volume: Amount of data being analyzed
What are the 3 main modules of the web crawler?
- Scheduler: maintains queue of URLs to visit
- Downloader: downloads pages
- Storage: Indexes pages and provides scheduler with metadata about pages received
What is the difference between long-term and short-term scheduling?
Long term is deciding what pages to visit next
Short term is rearranging pages to fulfill politeness requirements
How is short term scheduling implemented?
Scheduler maintains several queues (one for each site) and a list of pages to download in each to ensure we aren’t revisiting the same site too often
How do we handle the fact that there are infinitely many dynamic web pages and we can’t expect our crawler to download them?
We choose a maximum depth of links to follow
What is the difference between intentional and unintentional page duplication?
Intentional is the mirroring of other pages for load balancing, providing other languages, etc
Unintentional is just in the way a site is built such as the same URL but with different query parameters