01 Web search crawler Flashcards
what is webcrawling
the process of locating, fetching, storing pages available in the web
computer programs that perform this task referred to as:
- crawler
- spider
- harvester
- robot
what is web crawler repository
- cache for the online content
- provide quick access to physical copies of pages
- speed up indexing process
what is the fundamental assumption
- the web is well linked
crawlers exploit the hyperlink structure
basic web crawling process
- initialise url download queue (URL frontier) with some seed url
- repeat
- fetch content of url from queue
- store fetched content in repository
- extract hyperlink from content
- add new links to download queue
what are the crawler requirements**
scalability
- distribute and increase crawl rate by adding machines
robustness
- avoid spam and spider trap
selection
- cannot index everything, how to select
duplicates
- integrate duplication detection
politeness
- avoid overloading crawed sites
freshness
- refresh crawled content
what are the crawling challenges
- how to distribute crawling
- how to make best use of resources
- how deep should the site be crawled
- how often should we crawl
crawling can be divided into 3 sets
- downloaded
- discovered
- undiscovered
what is robot.txt
explicit politeness: advise web crawlers on which part of the site is accessible
implicit politeness
even without specification, avoid hitting site too often
spider traps
crawlers need to avoid
- ill formed HTML
- misleading/ hostile sites
- spams
solutions to spider traps
- no automatic technique can be foolproof
- check for url length
- trap guards
- prepare crawl statistics
- add black list to guard module
- eliminate url with non textual data type
duplication detection
if page is already in index, avoid wasting resources
exact duplicates:
easy to eliminate using hashing
near duplicates:
difficult to eliminate
identified using document fingerprint or shingles
key crawling components
- url frontier: queue for url to be crawled
- seen url: set() of crawled links
- fetcher: download url
- parser: extract outgoing links
- url filtering: filter url that are images
- content seen filtering: eliminate duplicate pages
URL prioritisation
2 queues for URL
1. discovery queue
- random
- breadth first
- in degree (has more links pointed to itself)
- page rank
- refreshing queue
- random
- age (older)
- page rank
- user feedback
- longevity (how often is page updated)
breadth first search**
append new url to end of the queue
FIFO
requires memory of all nodes on the previous level
depth first search**
append new url to the start of the queue
LIFO
memory of only depth times branching factor but may get too deep