Web Crawling Flashcards

Question 1

Q

What is the cycle of the web crawling process?

Answer

A

The crawler downloads a set of seed pages that are parsed and scanned for new links
The links that have not yet been downloaded are added to a central queue for later download
The crawler selects a new page for download and the process repeats until a stop criterion is met

Question 2

Q

What 2 things must a web crawler be?

Answer

A

Must be robust: Be immune to traps and malicious behaviour from web servers
Be polite: Respect implicit and explicit politeness considerations

Question 3

Q

What is the difference between explicit and implicit politeness?

Answer

A

Explicit is following specifications from webmasters on what portions of the site can be crawled

Implicit is following rules with no specification like not hitting a site too often

Question 4

Q

What is robots.txt?

Answer

A

A file that lives on a web server specifying access restrictions to tell crawlers what can and cannot be crawled

Question 5

Q

What things should a web crawler do?

Answer

A

Be capable of running on distributed machines
Be scalable, add more machines to increase the crawl rate
Fully use available processing and network resources for efficiency
Fetch pages of higher quality first
Continue fetching fresh copies of a previously fetched page
Be able to adapt to new data formats and protocols

Question 6

Q

What is the difference between a general web search and a vertical web search?

Answer

A

General is the type done by large search engines. Vertical is when a set of target pages is delineated by a topic, country, or language

Question 7

Q

What is the coverage and quality balance?

Answer

A

Crawlers must balance having many pages that could be used to answer many queries with the fact that pages should be of high quality

Question 8

Q

What is a vertical web crawler?

Answer

A

A web crawler focused on a particular subset of the web which may be defined geographically, linguistically, topically, by data type

Question 9

Q

How can a web crawler be used to analyze a web site?

Answer

A

Scanning for broken links
Validating code of pages
Analyzing directories by finding sites that are no longer available
Looking for security issues in structure, code, or scripts

Question 10

Q

What are the 3 axes by which we classify crawlers?

Answer

A

Quality: trusworthiness, veracity
Freshness: volatility and newness of info
Volume: Amount of data being analyzed

Question 11

Q

What are the 3 main modules of the web crawler?

Answer

A

Scheduler: maintains queue of URLs to visit
Downloader: downloads pages
Storage: Indexes pages and provides scheduler with metadata about pages received

Question 12

Q

What is the difference between long-term and short-term scheduling?

Answer

A

Long term is deciding what pages to visit next
Short term is rearranging pages to fulfill politeness requirements

Question 13

Q

How is short term scheduling implemented?

Answer

A

Scheduler maintains several queues (one for each site) and a list of pages to download in each to ensure we aren’t revisiting the same site too often

Question 14

Q

How do we handle the fact that there are infinitely many dynamic web pages and we can’t expect our crawler to download them?

Answer

A

We choose a maximum depth of links to follow

Question 15

Q

What is the difference between intentional and unintentional page duplication?

Answer

A

Intentional is the mirroring of other pages for load balancing, providing other languages, etc

Unintentional is just in the way a site is built such as the same URL but with different query parameters

Question 16

Q

What is an assignment function?

Answer

Study These Flashcards

A

A function that decides which process in a distributed crawler should download a given URL

Question 17

Q

What are the 3 properties of an effective assignment function according to Boldi et al?

Answer

Study These Flashcards

A

Balancing property: Each crawling process should have around the same number of hosts
Contra-variance property: If the number of crawling processes grows, the number of hosts assigned to each process must shrink
Dynamic property: Assignment must be able to add and remove crawling processes dynamically

Web Crawling Flashcards

(17 cards)