Lecture 5: Web Crawling Flashcards

Question 1

Q

What is a web crawler?

Answer

A

An autonomous data collection script

Question 2

Q

Why are web crawlers used?

Answer

A

Question 3

Q

What is web scrapping?

Answer

A

Typically performed over a set of
known pages

Used to extract data into a condensed format

Question 4

Q

What is the difference between web crawlers and web scraping?

Answer

A

A crawler is interested in finding new pages

A scraper is interested in extracting data from existing, known-format pages

Question 5

Q

What is the web crawling loop?

Answer

A

Question 6

Q

What are the methodologies for picking the next URL to crawl?

Answer

A

Breadth first: Crawl all links on that page first, before moving on
Depth first: Crawl the linked URLs first always, rapidly getting further from your start point
Alcohol first: Drunkenly stagger around from URL-to-URL with no discernible pattern

(6 cards)