Lecture 8 - Information Retrieval - Webcrawling Flashcards
What are some complications with web crawling?
Webcrawling is not always feasible with one machine
Malicious pages
- Spam pages - Spider traps - incl dynamically generated
Even non-malicious pages pose challenges
- Latency/bandwidth to remote servers vary - Webmaster's stipulations - How "deep" should you crawl a site's URL hierarchy? - Site mirrors and duplicate pages
Politeness - Don’t hit a server too often
What must any crawler do?
Be robust: Be immune to spider traps and other malicious behaviour from web servers
Be polite: Respect implicit and explicit politeness considerations
What is the difference between explicit and implicit politeness in web crawling?
Explicit: Respect specifications from robots.txt
Implicit: Even with no specification, avoid hitting any site too often
What is meant by robots.txt?
Robots.txt is a web document giving specifications on what portions of a site can(not) be crawled
Do you need to repeatedly fetch robots.txt?
No. Once you have fetched it, you do not need to fetch it again.
Doing so burns bandwidth and hits the webserver
What should any crawler do?
Be capable of distributed operation: Designed to run on multiple machines
Be scalable: Designed to increase the crawl rate by adding more machines
Performance/Efficiency: Permit full use of available processing and network resources
Fetch pages of “higher quality” first
Continuous operation: Continue fetching fresh copies of a previously fetched page
Extensible: Adapt to new data formats, protocols
What is meant by a “Crawl Frontier”?
A crawl frontier is one of the components that make up the architecture of a web crawler. The crawl frontier contains the logic and policies that a crawler follows when visiting websites
Name the processing steps in crawling
- Pick a URL from the frontier
- Fetch the document at the URL
- Parse the URL
- Extract links from it to other docs (URLs)
- Check if URL has content already seen
- If not, add to indexes
- For each extracted url
- 1 Ensure it passes certain URL filter tests
- 2 Check if it is already in the frontier (duplicate URL elimination)
What two goals may collide when configuring the crawl frontier?
Politeness - I.e don’t hit a web server too frequently
Freshness - Crawl some pages more often than others
What is the Mercator URL frontier?
The Mercator frontier is a frontier of processing urls that takes URLs from the top into the frontier.
The Front queues manage prioritization
The Back queues enforce politeness
Each queue is FIFO
Only one connection is open at any time to any host
A waiting time of a few seconds occurs between successive requests to a host
High priority pages are crawled preferentially
What is the difference between Duplication and Near-Duplication? And how can you detect both?
Duplication: Exact match - Can be detected with fingerprints (request fingerprints)
Near Duplication: Approximate match - Can be detected by computing syntactic similarity with an edit-distance measure, then use similarity threshold to detect near duplicates (e.g similarity is over 80% -> Document is a near duplicate)
In our paper, do we do any duplicate detection?
Scrapy has built in dupefilter class. This only checks for request fingerprints though (i.e Exact duplicate)
What is meant by shingles?
A set of unique n-grams
What does the Jaccard Coefficient measure?
The similarity between two sets
Explain the simple iterative logic for figuring out whether a node is good or bad.
Good nodes won’t point to bad nodes
All other combinations plausible
Therefore ->
Good nodes only point to good nodes.
If you point to a bad node, then you are bad.
If a good node points to you, you are also good.
What is Pagerank Scoring based on?
Pagerank is a scoring measure based on the idea of citation analysis. -> Bibliographic coupling frequency
(Articles that co-cite the same articles are related)
Explain how Pagerank scoring works
Imagine a user doing a random walk on web pages.
- Start at a random page
- At each step, go out of the current page along one of the links on that page, equiprobably
- “in the long run” each page has a long-term visit rate
- Use this as the page’s score
Why is Pagerank scoring insufficient?
Because the web is full of dead ends.
-> Random walks can get stuck in dead-ends
How can we deal with dead ends?
Teleporting!
- At any non-dead end, with probability 10%, jump to a random web page
- With remaining 90%, go out on a random link
- 10% is a parameter i.e can be modified
Result of teleporting:
- Cannot be stuck locally
- There is a long-term rate at which any page is visited (not obvious, will show this)
With teleporting, how can we compute the visit rate?
Markov Chains
What do markov chains do?
The Markov chain seeks to model probabilities of state transitions over time.
(Diana, maybe you can understand markov chains better, cuz I don’t really)
HITS is an acronym for…
Hyperlink-Induced Topic Search algorithm
Explain the HITS algorithm
In response to a query, instead of an ordered list of pages each meeting the query, find two sets of interrelated pages:
- Hub pages are good lists of links ona. subject
- e.g., “Bob’s list of cancer-related links.”
- Authority pages occur recurrently on good hubs for the subject
Is HITS best suited for Broad topic queries or page-finding queries?
Broad topic queries
What are some issues with the HITS algorithm?
Topic drift:
- Off topic pages can cause off-topic “authorities” to be returned
- E.g, the neighborhood graph can be about a “super topic” (Diseases vs leukemia
Mutually Reinforcing Affiliates:
- Affiliated pages/sites can boost each others’ scores
- Linkage between affiliated pages is not a useful signal