Web Scraping/Crawling Flashcards
What type of request is used to fetch the content of a web page from a web server?
GET
What standard Python library contains functions for requesting data across the web, handling cookies, and even changing metadata?
urllib
What import statement was used in our Lab and Group Project to import BeautifulSoup?
from bs4 import BeautifulSoup
What Wikipedia page did you webscrape in Lab 8?
1999-2000 FA Premier League - Wikipedia
What two fields(columns) required you to parse the text within the <a>…</a> tags?
Manager & Captain
When dealing with HTML elements mapped out as a tree, which elements are exactly one tag below a parent tag?
children
What store was the subject of the web scraping program reviewed during this lesson?
Family Dollar
What is the latest version of HTML?
HTML5
What is it called when the code accesses a URL, examines that page for another URL, retrieves that page, in a recursive process?
web crawling
What is web scraping?
An automated process of gathering large amounts of data from the Internet
Other names for web scraping:
- screen scraping
- data mining
- web harvesting
Why use web scraping?
- useful data is available on the web, but isn’t available via downloads or APIs
- price comparison info
- social media scraping (what’s trending?)
- research (stats, weather data, etc)
Difference between web scraping vs web crawling:
Web scraping: generally inspect a single web page or two to get the data they’re looking for.
Web crawling: “crawl across the web” following links from web page to web page recursively
Applications for web crawling:
- generating a site map
- gathering data about a specific topic from a large number of websites