Week 2 - Web Data Collection II Flashcards

1
Q

What is the limitation of APIs?

A

API is a Application Programming Interface

The limitation as follows:
1. Lack of API - not all platforms provide API
2. Availability of data - not all APIs share all their data
3. Freshness of data - new data might not be available immediately
4. Rate limits - how much data you can get at each call, the time between each call, and the number of data you get per call, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Web Scraping?

A

Web Scraping is the process of automatically extracting unstructured data from a webpage and transforming it into a structured dataset that can be analysed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the two main steps for Web Scraping?

A

Two main steps:
- Fetch (download) the HTML pages (source code) that contain the data
- Extract the relevant data from the HTML pages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Label each of the following HTML elements

<!DOCTYPE HTML>

<html>

<head>
<style>

   h1 { color: blue } 
</style>
<title>
web scraping
</title>
</head>

<body>
...
</body>

</html>
A

<!DOCTYPE html> - declaration of the document type

<html> - the root element which is parent to all other HTML elements

<head> - contains metadata about the HTML document (e.g., document title, character set, styles, links, scripts)

<body> - contains all the content of an HTML document (headings, paragraphs, tables, lists, images, hyperlinks, etc)
</body></head></html>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

HTML elements generally have a start tag, content, and end tag. List them all out.

A

Start tags: html, head, style, title, body, h1, p, ul, li
End tags: (/) HTML, head, style, title, body, h1, p, ul, li
content: inserted in between tags

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

HTML is often coupled with CSS; what are the elements within CSS

A

id - attribute is used to give an unique id for an element which can be used by one HTML element

class - multiple HTML elements can belong to the same class

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why CSS is useful for us?

A

We will be identifying elements via CSS selector notation

For example:
selecting by id: #myHeader
selecting by class: .language

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe all the selectors below:

  1. element
  2. element.class
  3. .class
  4. # id
  5. element element
  6. :first-child
A
  1. selects all <p> elements
  2. selects all <p> elements with class “intro”
  3. selects all elements with class “title”
  4. selects the element with the id attribute “contact”
  5. selects all <P> elements inside <div> elements
  6. selects every <p> element that is the first child of its parent
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is rvest package?

A

It is the R package that helps to scrape data from a web page

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the steps to webscraping with rvest?

A
  1. examine the webpage
  2. decide the data you want to scrape from the webpage
  3. identify the CSS selectors (use inspect element in the browser and other tools)
  4. write a program using the rvest package
  5. install and load the rvest library
  6. read a webpage into R by specifying the URL of the web page you would like to scrap using the function read_html()
  7. extract specified elements out of HTML documents using the functions (html_element(), html_elements()) and CSS selectors
  8. extract the components of elements using functions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the Selector Gadget?

A

Selector Gadget is to identify relevant CSS selectors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the functions used to extract components ?

A

html_text(): raw underlying text
html_text2(): simulates how text looks in a browser
html_table(): parse an html table into a data frame
html_attr() and html_attrs(): get element attributes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is an attribute?

A

Attributes are special words used within a tag to provide additional information about HTML elements

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the purpose of a pipe operator %>%

A

%>% simplifies the code by applying a series of functions to an object. This can be read as “and then” in literary terms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What if content requires some interaction to load?

A

Scrolling down for new content, filling in forms, clicking on buttons, making a search, animations, embedded media etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the purpose of having a dynamic content?

A

Need a way to automate a browser, to simulate human-user interaction and more advanced tools than rvest

17
Q

What is RSelenium?

A

RSelenium provides a range of tools and libraries for automating web browsers. The purpose of Selenium is to emulates human user interactions with browsers such as type, click, select, scrolling, etc

Selenium can replicate nearly all browser actions that can be performed manually

18
Q

List down the following actions for the functions below:

  1. navigate ()
  2. close()
  3. clickElement()
  4. sendKeysToElement()
  5. goBack() / goForward()
  6. refresh()
  7. getPageSource()
A
  1. navigate() - open a browser
  2. close() - close a browser
  3. clickElement() - click on an element
  4. sendKeysToElement() - enter values
  5. goBack() / goForward() - go to previous/next page
  6. refresh() - refresh the page
  7. getPageSource() - get all the HTML that is currently displayed
19
Q

How to avoid a timeout when Web Scraping pages?

A
  1. controlling the rate of scraping is important
  2. avoid overloading the server with tens of requests per second - don’t disrupt / harm the activity of the website
  3. fast-paced requests coming from the same IP address are likely to get banned
  4. gather data during the off-peak hours of the website
  5. mimic the normal behaviour of a human user
20
Q

What are the best practices for Web Scraping?

A
  1. Understand and exploit structure of web pages
  2. Check the terms and conditions of the websites legality of scraping
  3. Use robotstext to check if you are allowed to scrap and what you are allowed to scrap
  4. Observe the rate limits - don’t disrupt the operation of a website
  5. Only scrap data if you see a value - problem first then the method
  6. Check and confirm the ethics approval
21
Q

What is robotstxt?

A

Robots.txt can be found at the root a domain describing what sections of website robots can access and conditions (e.g., delay between calls)

22
Q

What is a user-agent in terms of robotstxt?

A

User-agent is the name of the web robot or scraper. This is used to specify a robot.

23
Q

What is allowed and not allowed in robotstxt?

A

Allow - scraping okay for the given page or directory (defined with a / e.g. / path /)

Disallow - scraping is not okay for the given page or directory

24
Q

What is crawl-delay?

A

N, the minimum waiting time between each request to the website

you can usually access the robots by calling the domain of the website and adding the “/robots.txt”

25
Q

How to use robotstxt from R?

A

install.packages(“robotstxt”)
library(robotstxt)

target_url <- “…”
get_robotstext(target_url)