Lecture 5: Web Crawling Flashcards
1
Q
What is a web crawler?
A
An autonomous data collection script
2
Q
Why are web crawlers used?
A
- Discover and index pages
- Perform market/sentiment analysis
- Intelligence gathering purposes
3
Q
What is web scrapping?
A
Typically performed over a set of
known pages
Used to extract data into a condensed format
4
Q
What is the difference between web crawlers and web scraping?
A
A crawler is interested in finding new pages
A scraper is interested in extracting data from existing, known-format pages
5
Q
What is the web crawling loop?
A
- Fetch the contents of a web page at a URL
- Comb through the contents and extract additional URLs
- Add the extracted URLs to a list of URLs to be crawled (to-crawl)
- Add the current URL to the list of completed URLs
- Pick a new URL from the to crawl list, return to step 1
6
Q
What are the methodologies for picking the next URL to crawl?
A
- Breadth first: Crawl all links on that page first, before moving on
- Depth first: Crawl the linked URLs first always, rapidly getting further from your start point
- Alcohol first: Drunkenly stagger around from URL-to-URL with no discernible pattern