Web Crawling Flashcards

1
Q

What is the cycle of the web crawling process?

A
  1. The crawler downloads a set of seed pages that are parsed and scanned for new links
  2. The links that have not yet been downloaded are added to a central queue for later download
  3. The crawler selects a new page for download and the process repeats until a stop criterion is met
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What 2 things must a web crawler be?

A
  1. Must be robust: Be immune to traps and malicious behaviour from web servers
  2. Be polite: Respect implicit and explicit politeness considerations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the difference between explicit and implicit politeness?

A

Explicit is following specifications from webmasters on what portions of the site can be crawled

Implicit is following rules with no specification like not hitting a site too often

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is robots.txt?

A

A file that lives on a web server specifying access restrictions to tell crawlers what can and cannot be crawled

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What things should a web crawler do?

A
  1. Be capable of running on distributed machines
  2. Be scalable, add more machines to increase the crawl rate
  3. Fully use available processing and network resources for efficiency
  4. Fetch pages of higher quality first
  5. Continue fetching fresh copies of a previously fetched page
  6. Be able to adapt to new data formats and protocols
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the difference between a general web search and a vertical web search?

A

General is the type done by large search engines. Vertical is when a set of target pages is delineated by a topic, country, or language

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the coverage and quality balance?

A

Crawlers must balance having many pages that could be used to answer many queries with the fact that pages should be of high quality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a vertical web crawler?

A

A web crawler focused on a particular subset of the web which may be defined geographically, linguistically, topically, by data type

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How can a web crawler be used to analyze a web site?

A
  1. Scanning for broken links
  2. Validating code of pages
  3. Analyzing directories by finding sites that are no longer available
  4. Looking for security issues in structure, code, or scripts
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the 3 axes by which we classify crawlers?

A
  1. Quality: trusworthiness, veracity
  2. Freshness: volatility and newness of info
  3. Volume: Amount of data being analyzed
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the 3 main modules of the web crawler?

A
  1. Scheduler: maintains queue of URLs to visit
  2. Downloader: downloads pages
  3. Storage: Indexes pages and provides scheduler with metadata about pages received
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the difference between long-term and short-term scheduling?

A

Long term is deciding what pages to visit next
Short term is rearranging pages to fulfill politeness requirements

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How is short term scheduling implemented?

A

Scheduler maintains several queues (one for each site) and a list of pages to download in each to ensure we aren’t revisiting the same site too often

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do we handle the fact that there are infinitely many dynamic web pages and we can’t expect our crawler to download them?

A

We choose a maximum depth of links to follow

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the difference between intentional and unintentional page duplication?

A

Intentional is the mirroring of other pages for load balancing, providing other languages, etc

Unintentional is just in the way a site is built such as the same URL but with different query parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is an assignment function?

A

A function that decides which process in a distributed crawler should download a given URL

17
Q

What are the 3 properties of an effective assignment function according to Boldi et al?

A
  1. Balancing property: Each crawling process should have around the same number of hosts
  2. Contra-variance property: If the number of crawling processes grows, the number of hosts assigned to each process must shrink
  3. Dynamic property: Assignment must be able to add and remove crawling processes dynamically