Lecture 8 - Information Retrieval - Webcrawling Flashcards by Simon Sardorf

What are some complications with web crawling?

Webcrawling is not always feasible with one machine

Malicious pages

- Spam pages
- Spider traps - incl dynamically generated

Even non-malicious pages pose challenges

- Latency/bandwidth to remote servers vary
- Webmaster's stipulations
    - How "deep" should you crawl a site's URL hierarchy?
- Site mirrors and duplicate pages

Politeness - Don’t hit a server too often

How well did you know this?

Not at all

Perfectly

What must any crawler do?

Be robust: Be immune to spider traps and other malicious behaviour from web servers

Be polite: Respect implicit and explicit politeness considerations

How well did you know this?

Not at all

Perfectly

What is the difference between explicit and implicit politeness in web crawling?

Explicit: Respect specifications from robots.txt

Implicit: Even with no specification, avoid hitting any site too often

How well did you know this?

Not at all

Perfectly

What is meant by robots.txt?

Robots.txt is a web document giving specifications on what portions of a site can(not) be crawled

How well did you know this?

Not at all

Perfectly

Do you need to repeatedly fetch robots.txt?

No. Once you have fetched it, you do not need to fetch it again.
Doing so burns bandwidth and hits the webserver

How well did you know this?

Not at all

Perfectly

What should any crawler do?

Be capable of distributed operation: Designed to run on multiple machines

Be scalable: Designed to increase the crawl rate by adding more machines

Performance/Efficiency: Permit full use of available processing and network resources

Fetch pages of “higher quality” first

Continuous operation: Continue fetching fresh copies of a previously fetched page

Extensible: Adapt to new data formats, protocols

How well did you know this?

Not at all

Perfectly

What is meant by a “Crawl Frontier”?

A crawl frontier is one of the components that make up the architecture of a web crawler. The crawl frontier contains the logic and policies that a crawler follows when visiting websites

How well did you know this?

Not at all

Perfectly

Name the processing steps in crawling

Pick a URL from the frontier
Fetch the document at the URL
Parse the URL
- Extract links from it to other docs (URLs)
Check if URL has content already seen
- If not, add to indexes
For each extracted url
1. 1 Ensure it passes certain URL filter tests
2. 2 Check if it is already in the frontier (duplicate URL elimination)

How well did you know this?

Not at all

Perfectly

What two goals may collide when configuring the crawl frontier?

Politeness - I.e don’t hit a web server too frequently

Freshness - Crawl some pages more often than others

How well did you know this?

Not at all

Perfectly

What is the Mercator URL frontier?

The Mercator frontier is a frontier of processing urls that takes URLs from the top into the frontier.

The Front queues manage prioritization
The Back queues enforce politeness

Each queue is FIFO

Only one connection is open at any time to any host
A waiting time of a few seconds occurs between successive requests to a host

High priority pages are crawled preferentially

How well did you know this?

Not at all

Perfectly

What is the difference between Duplication and Near-Duplication? And how can you detect both?

Duplication: Exact match - Can be detected with fingerprints (request fingerprints)

Near Duplication: Approximate match - Can be detected by computing syntactic similarity with an edit-distance measure, then use similarity threshold to detect near duplicates (e.g similarity is over 80% -> Document is a near duplicate)

How well did you know this?

Not at all

Perfectly

In our paper, do we do any duplicate detection?

Scrapy has built in dupefilter class. This only checks for request fingerprints though (i.e Exact duplicate)

How well did you know this?

Not at all

Perfectly

What is meant by shingles?

A set of unique n-grams

How well did you know this?

Not at all

Perfectly

What does the Jaccard Coefficient measure?

The similarity between two sets

How well did you know this?

Not at all

Perfectly

Explain the simple iterative logic for figuring out whether a node is good or bad.

Good nodes won’t point to bad nodes
All other combinations plausible

Therefore ->

Good nodes only point to good nodes.
If you point to a bad node, then you are bad.
If a good node points to you, you are also good.

How well did you know this?

Not at all

Perfectly

What is Pagerank Scoring based on?

Study These Flashcards

Pagerank is a scoring measure based on the idea of citation analysis. -> Bibliographic coupling frequency

(Articles that co-cite the same articles are related)

Explain how Pagerank scoring works

Study These Flashcards

Imagine a user doing a random walk on web pages.

Start at a random page
At each step, go out of the current page along one of the links on that page, equiprobably
“in the long run” each page has a long-term visit rate
- Use this as the page’s score

Why is Pagerank scoring insufficient?

Study These Flashcards

Because the web is full of dead ends.

-> Random walks can get stuck in dead-ends

How can we deal with dead ends?

Study These Flashcards

Teleporting!

At any non-dead end, with probability 10%, jump to a random web page
- With remaining 90%, go out on a random link
- 10% is a parameter i.e can be modified

Result of teleporting:

Cannot be stuck locally
There is a long-term rate at which any page is visited (not obvious, will show this)

With teleporting, how can we compute the visit rate?

Study These Flashcards

Markov Chains

What do markov chains do?

Study These Flashcards

The Markov chain seeks to model probabilities of state transitions over time.

(Diana, maybe you can understand markov chains better, cuz I don’t really)

HITS is an acronym for…

Study These Flashcards

Hyperlink-Induced Topic Search algorithm

Explain the HITS algorithm

Study These Flashcards

In response to a query, instead of an ordered list of pages each meeting the query, find two sets of interrelated pages:

Hub pages are good lists of links ona. subject
- e.g., “Bob’s list of cancer-related links.”
Authority pages occur recurrently on good hubs for the subject

Is HITS best suited for Broad topic queries or page-finding queries?

Study These Flashcards

Broad topic queries

What are some issues with the HITS algorithm?

Topic drift: - Off topic pages can cause off-topic "authorities" to be returned - E.g, the neighborhood graph can be about a "super topic" (Diseases vs leukemia Mutually Reinforcing Affiliates: - Affiliated pages/sites can boost each others' scores - Linkage between affiliated pages is not a useful signal

Lecture 8 - Information Retrieval - Webcrawling Flashcards

(25 cards)