Lecture 8 - Information Retrieval - Webcrawling Flashcards
What are some complications with web crawling?
Webcrawling is not always feasible with one machine
Malicious pages
- Spam pages - Spider traps - incl dynamically generated
Even non-malicious pages pose challenges
- Latency/bandwidth to remote servers vary - Webmaster's stipulations - How "deep" should you crawl a site's URL hierarchy? - Site mirrors and duplicate pages
Politeness - Don’t hit a server too often
What must any crawler do?
Be robust: Be immune to spider traps and other malicious behaviour from web servers
Be polite: Respect implicit and explicit politeness considerations
What is the difference between explicit and implicit politeness in web crawling?
Explicit: Respect specifications from robots.txt
Implicit: Even with no specification, avoid hitting any site too often
What is meant by robots.txt?
Robots.txt is a web document giving specifications on what portions of a site can(not) be crawled
Do you need to repeatedly fetch robots.txt?
No. Once you have fetched it, you do not need to fetch it again.
Doing so burns bandwidth and hits the webserver
What should any crawler do?
Be capable of distributed operation: Designed to run on multiple machines
Be scalable: Designed to increase the crawl rate by adding more machines
Performance/Efficiency: Permit full use of available processing and network resources
Fetch pages of “higher quality” first
Continuous operation: Continue fetching fresh copies of a previously fetched page
Extensible: Adapt to new data formats, protocols
What is meant by a “Crawl Frontier”?
A crawl frontier is one of the components that make up the architecture of a web crawler. The crawl frontier contains the logic and policies that a crawler follows when visiting websites
Name the processing steps in crawling
- Pick a URL from the frontier
- Fetch the document at the URL
- Parse the URL
- Extract links from it to other docs (URLs)
- Check if URL has content already seen
- If not, add to indexes
- For each extracted url
- 1 Ensure it passes certain URL filter tests
- 2 Check if it is already in the frontier (duplicate URL elimination)
What two goals may collide when configuring the crawl frontier?
Politeness - I.e don’t hit a web server too frequently
Freshness - Crawl some pages more often than others
What is the Mercator URL frontier?
The Mercator frontier is a frontier of processing urls that takes URLs from the top into the frontier.
The Front queues manage prioritization
The Back queues enforce politeness
Each queue is FIFO
Only one connection is open at any time to any host
A waiting time of a few seconds occurs between successive requests to a host
High priority pages are crawled preferentially
What is the difference between Duplication and Near-Duplication? And how can you detect both?
Duplication: Exact match - Can be detected with fingerprints (request fingerprints)
Near Duplication: Approximate match - Can be detected by computing syntactic similarity with an edit-distance measure, then use similarity threshold to detect near duplicates (e.g similarity is over 80% -> Document is a near duplicate)
In our paper, do we do any duplicate detection?
Scrapy has built in dupefilter class. This only checks for request fingerprints though (i.e Exact duplicate)
What is meant by shingles?
A set of unique n-grams
What does the Jaccard Coefficient measure?
The similarity between two sets
Explain the simple iterative logic for figuring out whether a node is good or bad.
Good nodes won’t point to bad nodes
All other combinations plausible
Therefore ->
Good nodes only point to good nodes.
If you point to a bad node, then you are bad.
If a good node points to you, you are also good.