Web Crawling 2 Flashcards

Question 1

Q

Why is breadth first search a good way to crawl?

Answer

A

Increases Web site coverage
Follows politeness policy by not requesting many pages from the same site in a row

Question 2

Q

What are the 3 distinct policies that make up a crawling algorithm?

Answer

A

Selection policy: Visit the best quality pages first
Re-visit policy: Update the index when pages change
Politeness policy: Avoid overloading websites

Question 3

Q

What are the 2 types of restrictions that define the selection policy?

Answer

A

Off-line limits that are set beforehand
On-line selection that is computed as the crawl progresses

Question 4

Q

What are some examples of off-line selection restrictions?

Answer

A

Maximum number of hosts to crawl
Maximum depth of crawl
Maximum number of pages in collection
Maximum number of pages or bytes downloaded from each server
A list of accepted mime types for downloading

Question 5

Q

What is on-line selection?

Answer

A

Prioritizing web pages based on their importance during the crawl.
Importance its based off intrinsic quality, popularity, URL, and other info

Question 6

Q

What are the 3 events that can occur during a crawl according to the revisit policy?

Answer

A

Creations: A new page is created and can be crawled
Updates: Can be minor (paragraph or sentence) or major (references to content are not valid)
Deletions: Undetected deletions are more damaging for a search engine than updates

Question 7

Q

What are the 2 cost functions associated with page updates?

Answer

A

Freshness: Binary measure that indicates if the local copy is up to date
Age: Measure that indicates how outdated the local copy is

Question 8

Q

How do we measure freshness of a page?

Answer

A

Freshness is 1 if the page is equal to the local copy at time t or 0 otherwise

Question 9

Q

How do we measure the age of a page?

Answer

A

Age is 0 if the page has not been modified at time t
t - lu(p) otherwise where lu(p) is the last update time

Question 10

Q

What are the 2 possible goals for the crawler when it comes to page updates?

Answer

A

Keep average freshness as high as possible
Keep the average age of pages as low as possible

Question 11

Q

What is the difference between uniform policy and proportional policy when it comes to page updates?

Answer

A

Uniform revisits all pages with the same frequency where proportional revisits more often the pages that change frequently
In terms of average freshness, uniform is better than proportional
In proportional, the crawler wastes time by trying to recrawl pages that change too often

Question 12

Q

How do search engines use multiple queues for scheduling page re-visits?

Answer

A

Use 2 or 3 queues each with different turnout times.
Ex: One queue for news sites that is refreshed several times per day and a daily or weekly queue for popular or relevant sites

Question 13

Q

What are the 3 basic rules for a web crawler in terms of politeness?

Answer

A

A web crawler must identify itself as such and not pretend to be a regular web user
A web crawler must obey the robots exclusion protocol (robots.txt)
A web crawler must keep a low bandwidth usage in a given website

Question 14

Q

How can a web crawler identify itself as such to a web server?

Answer

A

Using the HTTP user-agent header field that identifies who is issuing a request
The field for a web crawler should include an address to a page with info about the crawler and contact info

Question 15

Q

What are the 3 types of exclusion involved in the robot exclusion protocol?

Answer

A

Server-wide
Page-wise
Cache

Question 16

Q

What is server-wide exclusion?

Answer

Study These Flashcards

A

Instructs a crawler about directories of the site that should not be crawled. Performed using a single robots.txt file at the root of the site

Question 17

Q

What is page-wise exclusion?

Answer

Study These Flashcards

A

Including meta-tags on web pages to instruct a robot not to crawl a particular page

Question 18

Q

What is cache exclusion?

Answer

Study These Flashcards

A

Allowing web crawlers to index pages but not showing a local cached copy of the page

Question 19

Q

How can we calculate the time to complete a web crawl in optimal settings?

Answer

Study These Flashcards

A

sum of page sizes / B*
Where B* is the maximum bandwidth of the downloader

Question 20

Q

How can we measure bandwidth loss when we have non-optimal conditions?

Answer

Study These Flashcards

A

B* (T - T)
Where T is the ideal time and T is the actual time

Web Crawling 2 Flashcards

(20 cards)