Week 2 - Web Data Collection II Flashcards
What is the limitation of APIs?
API is a Application Programming Interface
The limitation as follows:
1. Lack of API - not all platforms provide API
2. Availability of data - not all APIs share all their data
3. Freshness of data - new data might not be available immediately
4. Rate limits - how much data you can get at each call, the time between each call, and the number of data you get per call, etc.
What is Web Scraping?
Web Scraping is the process of automatically extracting unstructured data from a webpage and transforming it into a structured dataset that can be analysed.
What are the two main steps for Web Scraping?
Two main steps:
- Fetch (download) the HTML pages (source code) that contain the data
- Extract the relevant data from the HTML pages
Label each of the following HTML elements
<!DOCTYPE HTML>
<html>
<head>
<style>
h1 { color: blue }</style>
<title>
web scraping
</title>
</head>
<body>
...
</body>
</html>
<!DOCTYPE html> - declaration of the document type
<html> - the root element which is parent to all other HTML elements
<head> - contains metadata about the HTML document (e.g., document title, character set, styles, links, scripts)
<body> - contains all the content of an HTML document (headings, paragraphs, tables, lists, images, hyperlinks, etc)
</body></head></html>
HTML elements generally have a start tag, content, and end tag. List them all out.
Start tags: html, head, style, title, body, h1, p, ul, li
End tags: (/) HTML, head, style, title, body, h1, p, ul, li
content: inserted in between tags
HTML is often coupled with CSS; what are the elements within CSS
id - attribute is used to give an unique id for an element which can be used by one HTML element
class - multiple HTML elements can belong to the same class
Why CSS is useful for us?
We will be identifying elements via CSS selector notation
For example:
selecting by id: #myHeader
selecting by class: .language
Describe all the selectors below:
- element
- element.class
- .class
- # id
- element element
- :first-child
- selects all <p> elements
- selects all <p> elements with class “intro”
- selects all elements with class “title”
- selects the element with the id attribute “contact”
- selects all <P> elements inside <div> elements
- selects every <p> element that is the first child of its parent
What is rvest package?
It is the R package that helps to scrape data from a web page
What are the steps to webscraping with rvest?
- examine the webpage
- decide the data you want to scrape from the webpage
- identify the CSS selectors (use inspect element in the browser and other tools)
- write a program using the rvest package
- install and load the rvest library
- read a webpage into R by specifying the URL of the web page you would like to scrap using the function read_html()
- extract specified elements out of HTML documents using the functions (html_element(), html_elements()) and CSS selectors
- extract the components of elements using functions
What is the Selector Gadget?
Selector Gadget is to identify relevant CSS selectors
What are the functions used to extract components ?
html_text(): raw underlying text
html_text2(): simulates how text looks in a browser
html_table(): parse an html table into a data frame
html_attr() and html_attrs(): get element attributes
What is an attribute?
Attributes are special words used within a tag to provide additional information about HTML elements
What is the purpose of a pipe operator %>%
%>% simplifies the code by applying a series of functions to an object. This can be read as “and then” in literary terms
What if content requires some interaction to load?
Scrolling down for new content, filling in forms, clicking on buttons, making a search, animations, embedded media etc.
What is the purpose of having a dynamic content?
Need a way to automate a browser, to simulate human-user interaction and more advanced tools than rvest
What is RSelenium?
RSelenium provides a range of tools and libraries for automating web browsers. The purpose of Selenium is to emulates human user interactions with browsers such as type, click, select, scrolling, etc
Selenium can replicate nearly all browser actions that can be performed manually
List down the following actions for the functions below:
- navigate ()
- close()
- clickElement()
- sendKeysToElement()
- goBack() / goForward()
- refresh()
- getPageSource()
- navigate() - open a browser
- close() - close a browser
- clickElement() - click on an element
- sendKeysToElement() - enter values
- goBack() / goForward() - go to previous/next page
- refresh() - refresh the page
- getPageSource() - get all the HTML that is currently displayed
How to avoid a timeout when Web Scraping pages?
- controlling the rate of scraping is important
- avoid overloading the server with tens of requests per second - don’t disrupt / harm the activity of the website
- fast-paced requests coming from the same IP address are likely to get banned
- gather data during the off-peak hours of the website
- mimic the normal behaviour of a human user
What are the best practices for Web Scraping?
- Understand and exploit structure of web pages
- Check the terms and conditions of the websites legality of scraping
- Use robotstext to check if you are allowed to scrap and what you are allowed to scrap
- Observe the rate limits - don’t disrupt the operation of a website
- Only scrap data if you see a value - problem first then the method
- Check and confirm the ethics approval
What is robotstxt?
Robots.txt can be found at the root a domain describing what sections of website robots can access and conditions (e.g., delay between calls)
What is a user-agent in terms of robotstxt?
User-agent is the name of the web robot or scraper. This is used to specify a robot.
What is allowed and not allowed in robotstxt?
Allow - scraping okay for the given page or directory (defined with a / e.g. / path /)
Disallow - scraping is not okay for the given page or directory
What is crawl-delay?
N, the minimum waiting time between each request to the website
you can usually access the robots by calling the domain of the website and adding the “/robots.txt”
How to use robotstxt from R?
install.packages(“robotstxt”)
library(robotstxt)
target_url <- “…”
get_robotstext(target_url)