Session 2 Flashcards

1
Q

Is Webscraping legal?

A

Webscraping falls in a gray area both in the EU and USA.

So, it depends

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Almost any data source can be used in the context of a Data Mining project

A
  • Data Mining is an exploratory process with uncertain outcomes
  • Being able to collect data from different systems allows for fast prototyping and adjustments as needed
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Data Mining is…

A

… an exploratory process with uncertain outcomes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

A proper engineering solution should be deployed once…

A

the prototype demonstrates its merits

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Four major ways of collecting data from online sources

A

1 Manually browsing a web site (copy & paste)
2 Manually downloading a file
3 Pretending you are a human browsing a web site (web scraping)
4 Using an Application Programming Interface (API)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Web scraping can be done using

A
  • A modern programming language which offers complete
    flexibility but requires more effort to implement; or
  • Specialized tools which allow faster implementation but provide less flexibility and make it harder to replicate data collection
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Web scraping using a programming language

A

Many languages provide functionality for reading and writing data from web sites, just like a regular web browser

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

webscraper.io

A

is a more sophisticated tool that allows the user to select which elements of a web site are important and which links should be followed in order to gather more information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Potential issues with web scraping

A
  1. Many sites do not allow gathering information automatically
  2. Not all data are publicly available online
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Ways to detect whether the site is being viewed by a human

A
  • Detection of frequent requests
  • Cookies
  • Other trackers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Robots Exclusion Protocol

A

The robots exclusion protocol or the robots.txt protocol is a way to communicate with crawlers or web bots with instructions on whether you can automatically scrape parts of a web site

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Potential issue web scraping: Not all data are publicly available online

Workaround

A

Workarounds include using authentication to access the protected information or using an API access

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Application Programming Interfaces (APIs)

A

are protocols to interact with specific web sites that can be used by any registered user

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Three steps for API access

A
  1. Get an API key
  2. Query an API endpoint using the API key
  3. Process the response
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

API key is…

A

like a valet key for the web

  • Provides access to a limited set of functions
  • Can be revoked by issuer at anytime
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

An API usually provides multiple endpoints or functions. Examples:

A

e.g., most recent movies, most popular movies, search movies by keyword