Lecture 8 - Information Retrieval - Webcrawling Flashcards

1
Q

What are some complications with web crawling?

A

Webcrawling is not always feasible with one machine

Malicious pages

- Spam pages
- Spider traps - incl dynamically generated

Even non-malicious pages pose challenges

- Latency/bandwidth to remote servers vary
- Webmaster's stipulations
    - How "deep" should you crawl a site's URL hierarchy?
- Site mirrors and duplicate pages

Politeness - Don’t hit a server too often

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What must any crawler do?

A

Be robust: Be immune to spider traps and other malicious behaviour from web servers

Be polite: Respect implicit and explicit politeness considerations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the difference between explicit and implicit politeness in web crawling?

A

Explicit: Respect specifications from robots.txt

Implicit: Even with no specification, avoid hitting any site too often

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is meant by robots.txt?

A

Robots.txt is a web document giving specifications on what portions of a site can(not) be crawled

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Do you need to repeatedly fetch robots.txt?

A

No. Once you have fetched it, you do not need to fetch it again.
Doing so burns bandwidth and hits the webserver

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What should any crawler do?

A

Be capable of distributed operation: Designed to run on multiple machines

Be scalable: Designed to increase the crawl rate by adding more machines

Performance/Efficiency: Permit full use of available processing and network resources

Fetch pages of “higher quality” first

Continuous operation: Continue fetching fresh copies of a previously fetched page

Extensible: Adapt to new data formats, protocols

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is meant by a “Crawl Frontier”?

A

A crawl frontier is one of the components that make up the architecture of a web crawler. The crawl frontier contains the logic and policies that a crawler follows when visiting websites

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Name the processing steps in crawling

A
  1. Pick a URL from the frontier
  2. Fetch the document at the URL
  3. Parse the URL
    • Extract links from it to other docs (URLs)
  4. Check if URL has content already seen
    • If not, add to indexes
  5. For each extracted url
    1. 1 Ensure it passes certain URL filter tests
    2. 2 Check if it is already in the frontier (duplicate URL elimination)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What two goals may collide when configuring the crawl frontier?

A

Politeness - I.e don’t hit a web server too frequently

Freshness - Crawl some pages more often than others

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the Mercator URL frontier?

A

The Mercator frontier is a frontier of processing urls that takes URLs from the top into the frontier.

The Front queues manage prioritization
The Back queues enforce politeness

Each queue is FIFO

Only one connection is open at any time to any host
A waiting time of a few seconds occurs between successive requests to a host

High priority pages are crawled preferentially

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the difference between Duplication and Near-Duplication? And how can you detect both?

A

Duplication: Exact match - Can be detected with fingerprints (request fingerprints)

Near Duplication: Approximate match - Can be detected by computing syntactic similarity with an edit-distance measure, then use similarity threshold to detect near duplicates (e.g similarity is over 80% -> Document is a near duplicate)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

In our paper, do we do any duplicate detection?

A

Scrapy has built in dupefilter class. This only checks for request fingerprints though (i.e Exact duplicate)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is meant by shingles?

A

A set of unique n-grams

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does the Jaccard Coefficient measure?

A

The similarity between two sets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Explain the simple iterative logic for figuring out whether a node is good or bad.

A

Good nodes won’t point to bad nodes
All other combinations plausible

Therefore ->

Good nodes only point to good nodes.
If you point to a bad node, then you are bad.
If a good node points to you, you are also good.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Pagerank Scoring based on?

A

Pagerank is a scoring measure based on the idea of citation analysis. -> Bibliographic coupling frequency

(Articles that co-cite the same articles are related)

17
Q

Explain how Pagerank scoring works

A

Imagine a user doing a random walk on web pages.

  • Start at a random page
  • At each step, go out of the current page along one of the links on that page, equiprobably
  • “in the long run” each page has a long-term visit rate
    • Use this as the page’s score
18
Q

Why is Pagerank scoring insufficient?

A

Because the web is full of dead ends.

-> Random walks can get stuck in dead-ends

19
Q

How can we deal with dead ends?

A

Teleporting!

  • At any non-dead end, with probability 10%, jump to a random web page
    • With remaining 90%, go out on a random link
    • 10% is a parameter i.e can be modified

Result of teleporting:

  • Cannot be stuck locally
  • There is a long-term rate at which any page is visited (not obvious, will show this)
20
Q

With teleporting, how can we compute the visit rate?

A

Markov Chains

21
Q

What do markov chains do?

A

The Markov chain seeks to model probabilities of state transitions over time.

(Diana, maybe you can understand markov chains better, cuz I don’t really)

22
Q

HITS is an acronym for…

A

Hyperlink-Induced Topic Search algorithm

23
Q

Explain the HITS algorithm

A

In response to a query, instead of an ordered list of pages each meeting the query, find two sets of interrelated pages:

  • Hub pages are good lists of links ona. subject
    • e.g., “Bob’s list of cancer-related links.”
  • Authority pages occur recurrently on good hubs for the subject
24
Q

Is HITS best suited for Broad topic queries or page-finding queries?

A

Broad topic queries

25
Q

What are some issues with the HITS algorithm?

A

Topic drift:

  • Off topic pages can cause off-topic “authorities” to be returned
    • E.g, the neighborhood graph can be about a “super topic” (Diseases vs leukemia

Mutually Reinforcing Affiliates:

  • Affiliated pages/sites can boost each others’ scores
    • Linkage between affiliated pages is not a useful signal