Web Crawling 2 Flashcards

1
Q

Why is breadth first search a good way to crawl?

A
  1. Increases Web site coverage
  2. Follows politeness policy by not requesting many pages from the same site in a row
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the 3 distinct policies that make up a crawling algorithm?

A
  1. Selection policy: Visit the best quality pages first
  2. Re-visit policy: Update the index when pages change
  3. Politeness policy: Avoid overloading websites
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the 2 types of restrictions that define the selection policy?

A
  1. Off-line limits that are set beforehand
  2. On-line selection that is computed as the crawl progresses
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are some examples of off-line selection restrictions?

A
  1. Maximum number of hosts to crawl
  2. Maximum depth of crawl
  3. Maximum number of pages in collection
  4. Maximum number of pages or bytes downloaded from each server
  5. A list of accepted mime types for downloading
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is on-line selection?

A

Prioritizing web pages based on their importance during the crawl.
Importance its based off intrinsic quality, popularity, URL, and other info

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the 3 events that can occur during a crawl according to the revisit policy?

A
  1. Creations: A new page is created and can be crawled
  2. Updates: Can be minor (paragraph or sentence) or major (references to content are not valid)
  3. Deletions: Undetected deletions are more damaging for a search engine than updates
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the 2 cost functions associated with page updates?

A
  1. Freshness: Binary measure that indicates if the local copy is up to date
  2. Age: Measure that indicates how outdated the local copy is
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do we measure freshness of a page?

A

Freshness is 1 if the page is equal to the local copy at time t or 0 otherwise

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How do we measure the age of a page?

A

Age is 0 if the page has not been modified at time t
t - lu(p) otherwise where lu(p) is the last update time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the 2 possible goals for the crawler when it comes to page updates?

A
  1. Keep average freshness as high as possible
  2. Keep the average age of pages as low as possible
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the difference between uniform policy and proportional policy when it comes to page updates?

A
  1. Uniform revisits all pages with the same frequency where proportional revisits more often the pages that change frequently
  2. In terms of average freshness, uniform is better than proportional
  3. In proportional, the crawler wastes time by trying to recrawl pages that change too often
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do search engines use multiple queues for scheduling page re-visits?

A

Use 2 or 3 queues each with different turnout times.
Ex: One queue for news sites that is refreshed several times per day and a daily or weekly queue for popular or relevant sites

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the 3 basic rules for a web crawler in terms of politeness?

A
  1. A web crawler must identify itself as such and not pretend to be a regular web user
  2. A web crawler must obey the robots exclusion protocol (robots.txt)
  3. A web crawler must keep a low bandwidth usage in a given website
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How can a web crawler identify itself as such to a web server?

A

Using the HTTP user-agent header field that identifies who is issuing a request
The field for a web crawler should include an address to a page with info about the crawler and contact info

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the 3 types of exclusion involved in the robot exclusion protocol?

A
  1. Server-wide
  2. Page-wise
  3. Cache
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is server-wide exclusion?

A

Instructs a crawler about directories of the site that should not be crawled. Performed using a single robots.txt file at the root of the site

17
Q

What is page-wise exclusion?

A

Including meta-tags on web pages to instruct a robot not to crawl a particular page

<meta></meta>

18
Q

What is cache exclusion?

A

Allowing web crawlers to index pages but not showing a local cached copy of the page

<meta></meta>

19
Q

How can we calculate the time to complete a web crawl in optimal settings?

A

sum of page sizes / B*
Where B* is the maximum bandwidth of the downloader

20
Q

How can we measure bandwidth loss when we have non-optimal conditions?

A

B* (T - T)
Where T
is the ideal time and T is the actual time