Session 2 Flashcards
Is Webscraping legal?
Webscraping falls in a gray area both in the EU and USA.
So, it depends
Almost any data source can be used in the context of a Data Mining project
- Data Mining is an exploratory process with uncertain outcomes
- Being able to collect data from different systems allows for fast prototyping and adjustments as needed
Data Mining is…
… an exploratory process with uncertain outcomes
A proper engineering solution should be deployed once…
the prototype demonstrates its merits
Four major ways of collecting data from online sources
1 Manually browsing a web site (copy & paste)
2 Manually downloading a file
3 Pretending you are a human browsing a web site (web scraping)
4 Using an Application Programming Interface (API)
Web scraping can be done using
- A modern programming language which offers complete
flexibility but requires more effort to implement; or - Specialized tools which allow faster implementation but provide less flexibility and make it harder to replicate data collection
Web scraping using a programming language
Many languages provide functionality for reading and writing data from web sites, just like a regular web browser
webscraper.io
is a more sophisticated tool that allows the user to select which elements of a web site are important and which links should be followed in order to gather more information
Potential issues with web scraping
- Many sites do not allow gathering information automatically
- Not all data are publicly available online
Ways to detect whether the site is being viewed by a human
- Detection of frequent requests
- Cookies
- Other trackers
Robots Exclusion Protocol
The robots exclusion protocol or the robots.txt protocol is a way to communicate with crawlers or web bots with instructions on whether you can automatically scrape parts of a web site
Potential issue web scraping: Not all data are publicly available online
Workaround
Workarounds include using authentication to access the protected information or using an API access
Application Programming Interfaces (APIs)
are protocols to interact with specific web sites that can be used by any registered user
Three steps for API access
- Get an API key
- Query an API endpoint using the API key
- Process the response
API key is…
like a valet key for the web
- Provides access to a limited set of functions
- Can be revoked by issuer at anytime
An API usually provides multiple endpoints or functions. Examples:
e.g., most recent movies, most popular movies, search movies by keyword