Session 2 Flashcards

Question 1

Q

Is Webscraping legal?

Answer

A

Webscraping falls in a gray area both in the EU and USA.

So, it depends

Question 2

Q

Almost any data source can be used in the context of a Data Mining project

Answer

A

Data Mining is an exploratory process with uncertain outcomes
Being able to collect data from different systems allows for fast prototyping and adjustments as needed

Question 3

Q

Data Mining is…

Answer

A

… an exploratory process with uncertain outcomes

Question 4

Q

A proper engineering solution should be deployed once…

Answer

A

the prototype demonstrates its merits

Question 5

Q

Four major ways of collecting data from online sources

Answer

A

1 Manually browsing a web site (copy & paste)
2 Manually downloading a file
3 Pretending you are a human browsing a web site (web scraping)
4 Using an Application Programming Interface (API)

Question 6

Q

Web scraping can be done using

Answer

A

A modern programming language which offers complete
flexibility but requires more effort to implement; or
Specialized tools which allow faster implementation but provide less flexibility and make it harder to replicate data collection

Question 7

Q

Web scraping using a programming language

Answer

A

Many languages provide functionality for reading and writing data from web sites, just like a regular web browser

Question 8

Q

webscraper.io

Answer

A

is a more sophisticated tool that allows the user to select which elements of a web site are important and which links should be followed in order to gather more information

Question 9

Q

Potential issues with web scraping

Answer

A

Many sites do not allow gathering information automatically
Not all data are publicly available online

Question 10

Q

Ways to detect whether the site is being viewed by a human

Answer

A

Detection of frequent requests
Cookies
Other trackers

Question 11

Q

Robots Exclusion Protocol

Answer

A

The robots exclusion protocol or the robots.txt protocol is a way to communicate with crawlers or web bots with instructions on whether you can automatically scrape parts of a web site

Question 12

Q

Potential issue web scraping: Not all data are publicly available online

Workaround

Answer

A

Workarounds include using authentication to access the protected information or using an API access

Question 13

Q

Application Programming Interfaces (APIs)

Answer

A

are protocols to interact with specific web sites that can be used by any registered user

Question 14

Q

Three steps for API access

Answer

A

Get an API key
Query an API endpoint using the API key
Process the response

Question 15

Q

API key is…

Answer

A

like a valet key for the web

Provides access to a limited set of functions
Can be revoked by issuer at anytime

Question 16

Q

An API usually provides multiple endpoints or functions. Examples:

Answer

Study These Flashcards

A

e.g., most recent movies, most popular movies, search movies by keyword

Session 2 Flashcards

(16 cards)