Web Crawling 2 Flashcards
Why is breadth first search a good way to crawl?
- Increases Web site coverage
- Follows politeness policy by not requesting many pages from the same site in a row
What are the 3 distinct policies that make up a crawling algorithm?
- Selection policy: Visit the best quality pages first
- Re-visit policy: Update the index when pages change
- Politeness policy: Avoid overloading websites
What are the 2 types of restrictions that define the selection policy?
- Off-line limits that are set beforehand
- On-line selection that is computed as the crawl progresses
What are some examples of off-line selection restrictions?
- Maximum number of hosts to crawl
- Maximum depth of crawl
- Maximum number of pages in collection
- Maximum number of pages or bytes downloaded from each server
- A list of accepted mime types for downloading
What is on-line selection?
Prioritizing web pages based on their importance during the crawl.
Importance its based off intrinsic quality, popularity, URL, and other info
What are the 3 events that can occur during a crawl according to the revisit policy?
- Creations: A new page is created and can be crawled
- Updates: Can be minor (paragraph or sentence) or major (references to content are not valid)
- Deletions: Undetected deletions are more damaging for a search engine than updates
What are the 2 cost functions associated with page updates?
- Freshness: Binary measure that indicates if the local copy is up to date
- Age: Measure that indicates how outdated the local copy is
How do we measure freshness of a page?
Freshness is 1 if the page is equal to the local copy at time t or 0 otherwise
How do we measure the age of a page?
Age is 0 if the page has not been modified at time t
t - lu(p) otherwise where lu(p) is the last update time
What are the 2 possible goals for the crawler when it comes to page updates?
- Keep average freshness as high as possible
- Keep the average age of pages as low as possible
What is the difference between uniform policy and proportional policy when it comes to page updates?
- Uniform revisits all pages with the same frequency where proportional revisits more often the pages that change frequently
- In terms of average freshness, uniform is better than proportional
- In proportional, the crawler wastes time by trying to recrawl pages that change too often
How do search engines use multiple queues for scheduling page re-visits?
Use 2 or 3 queues each with different turnout times.
Ex: One queue for news sites that is refreshed several times per day and a daily or weekly queue for popular or relevant sites
What are the 3 basic rules for a web crawler in terms of politeness?
- A web crawler must identify itself as such and not pretend to be a regular web user
- A web crawler must obey the robots exclusion protocol (robots.txt)
- A web crawler must keep a low bandwidth usage in a given website
How can a web crawler identify itself as such to a web server?
Using the HTTP user-agent header field that identifies who is issuing a request
The field for a web crawler should include an address to a page with info about the crawler and contact info
What are the 3 types of exclusion involved in the robot exclusion protocol?
- Server-wide
- Page-wise
- Cache