Lecture 5: Web Crawling Flashcards

1
Q

What is a web crawler?

A

An autonomous data collection script

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why are web crawlers used?

A
  • Discover and index pages
  • Perform market/sentiment analysis
  • Intelligence gathering purposes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is web scrapping?

A

Typically performed over a set of
known pages

Used to extract data into a condensed format

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the difference between web crawlers and web scraping?

A

A crawler is interested in finding new pages

A scraper is interested in extracting data from existing, known-format pages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the web crawling loop?

A
  1. Fetch the contents of a web page at a URL
  2. Comb through the contents and extract additional URLs
  3. Add the extracted URLs to a list of URLs to be crawled (to-crawl)
  4. Add the current URL to the list of completed URLs
  5. Pick a new URL from the to crawl list, return to step 1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the methodologies for picking the next URL to crawl?

A
  • Breadth first: Crawl all links on that page first, before moving on
  • Depth first: Crawl the linked URLs first always, rapidly getting further from your start point
  • Alcohol first: Drunkenly stagger around from URL-to-URL with no discernible pattern
How well did you know this?
1
Not at all
2
3
4
5
Perfectly