01 Web search crawler Flashcards
what is webcrawling
the process of locating, fetching, storing pages available in the web
computer programs that perform this task referred to as:
- crawler
- spider
- harvester
- robot
what is web crawler repository
- cache for the online content
- provide quick access to physical copies of pages
- speed up indexing process
what is the fundamental assumption
- the web is well linked
crawlers exploit the hyperlink structure
basic web crawling process
- initialise url download queue (URL frontier) with some seed url
- repeat
- fetch content of url from queue
- store fetched content in repository
- extract hyperlink from content
- add new links to download queue
what are the crawler requirements**
scalability
- distribute and increase crawl rate by adding machines
robustness
- avoid spam and spider trap
selection
- cannot index everything, how to select
duplicates
- integrate duplication detection
politeness
- avoid overloading crawed sites
freshness
- refresh crawled content
what are the crawling challenges
- how to distribute crawling
- how to make best use of resources
- how deep should the site be crawled
- how often should we crawl
crawling can be divided into 3 sets
- downloaded
- discovered
- undiscovered
what is robot.txt
explicit politeness: advise web crawlers on which part of the site is accessible
implicit politeness
even without specification, avoid hitting site too often
spider traps
crawlers need to avoid
- ill formed HTML
- misleading/ hostile sites
- spams
solutions to spider traps
- no automatic technique can be foolproof
- check for url length
- trap guards
- prepare crawl statistics
- add black list to guard module
- eliminate url with non textual data type
duplication detection
if page is already in index, avoid wasting resources
exact duplicates:
easy to eliminate using hashing
near duplicates:
difficult to eliminate
identified using document fingerprint or shingles
key crawling components
- url frontier: queue for url to be crawled
- seen url: set() of crawled links
- fetcher: download url
- parser: extract outgoing links
- url filtering: filter url that are images
- content seen filtering: eliminate duplicate pages
URL prioritisation
2 queues for URL
1. discovery queue
- random
- breadth first
- in degree (has more links pointed to itself)
- page rank
- refreshing queue
- random
- age (older)
- page rank
- user feedback
- longevity (how often is page updated)
breadth first search**
append new url to end of the queue
FIFO
requires memory of all nodes on the previous level
depth first search**
append new url to the start of the queue
LIFO
memory of only depth times branching factor but may get too deep
what is the crawling metrics
Quality metrics
1. coverage
2. freshness
3. page importance
Performance metrics
1. throughput: content download rate
mirror sites
replica of existing sites
can lead to redundant crawling
can be detected using:
- url similarity
- link structure
- content similarity
geographically distributed webcrawling
- higher crawling throughput
- proximity
- lower crawling latency - improved politeness
- less overhead on routers because of fewer hops - better coverage
- increased availability
why is data structure important
efficiency for webcrawler
seen URL table, as url continue to grow
highspace requirements
frequent url cached in memory