Lecture 11 Revision Flashcards
Recap: What is a protocol?
A protocol is a set of rules, conventions, and data formats that
govern how data is transmitted, received, and interpreted by
interconnected devices or programs.
What protocol defines how messages are sent between computers via the internet?
TCP /IP
Transmission Control Protocol / Internet Protocol.
Recap: What is the Internet?
The Internet is a global network of interconnected computers that communicate using standardised protocols (such as TCP/IP) to exchange data and enable services.
Recap: What is the World Wide Web (www)?
The WWW is all the public websites and pages held on remote computer (servers) that users can access on their local computers (clients) via the Internet.
The WWW relies on:
- standardised protocols that allow the browser application on the client, to communicate with the server software on the server computer
- standardised file formats and rendering rules that allow clients to consistently display the web pages to the user
The server computer stores web pages. The browser downloads webpages and renders web pages on the screen.
The two protocols relied on are HTTP (for downloading web pages) and HTML (specifying the format of files containing webpages).
Recap: www Protocols. What is HTTP?
The HyperText Transfer Protocol (HTTP) defines how a browser and a server communicate with each other on the World Web Web.
Browsers send requests to servers.
Servers send responses to browsers.
There are two main types of requests:
A GET request is sent to retrieve a resource from a web server.
A POST request is sent to transfer a large amount of data to a web server.
Hypertext Transfer Protocol Secure (HTTPS) is the encrypted
version of HTTP.
Recap: www Protocols. What is HTML?
HyperText Markup Language (HTML) defines the format of a web file and how it’s content should be rendered on screen.
An HTML file is an ASCII (text) file.
An HTML file contains embedded elements.
An element is a building block of an HTML document. An element contains content and information about organisation of the page.
Typically, an element consists of:
an opening tag
content
a closing tag
tages are enclosed in <> and the letter inside them indicates the type of element. (e.g <p> is a paragraph).
A web resource is uniquely identified on the WWW by its Uniform Resource Locator (URL). A URL syntax is:
<protocol>//<domain>/<path>/<filename>
For example:
https://www.ucd.ie/about-ucd/index.html
The browser normally shows the URL of the current web page.
Often, the HTML filename is not displayed. The default is index.html.
</filename></path></domain></protocol>
What is web scraping?
Web scraping is the process of automatically extracting data from websites.
Web scraping involves:
- fetching the web page(s)
- parsing the HTML to extract specific pieces of information
Scraping is useful when you want to gather data from web pages that don’t provide an Application Programming Interface (API) for direct access.
What is a web crawler?
A web crawler (a.k.a. web spider, web robot or bot) is a program that systematically browses and collects information by visiting web pages.
What is robots.txt?
Websites often have a robots.txt file that specifies which pages or sections of the site can be crawled or scraped. Most legitimate bots follow the rules set in robots.txt,
… But there is no way to enforce it for malicious bots.
Structure of a robots.txt file:
Following is an example of a robots.txt:
User-agent: *
Disallow: /private/
Allow: /public/
User-agent specifies the bots that the next rules apply to. The wildcard (*) indicates that the rules apply to all bots.
Disallow specifies a path which the bot should not visit.
Allow specifies a path which the bot can visit, even if a broader Disallow directive is given.
The Disallow and Allow rules apply to all of the files and directories under the specified path.
In a robots.txt file, a hash symbol (#) can be used as a start of a line comment marker.
What does this robots.txt mean?
User-agent: *
Disallow:
ALL web crawlers (user-agent *) are allowed access to ALL content (because there is nothing in the disallow:)
What does this robots.txt mean?
User-agent: *
Disallow: /
ALL web crawlers are disallowed ALL access. The / means DISALLOW ALL.
What are the ethical considerations when web scraping?
- Respect robots.txt
- Avoid Overloading Servers: Bots can generate a lot of requests very quickly that can overload the server. Scraping too frequently can put unnecessary load on the server. A programmer should put in appropriate delays between requests.
- Legal Considerations: Always check the site’s terms of service to ensure scraping is allowed. Some websites explicitly prohibit scraping. E.g. https://www.ucd.ie/disclaimer/
What is a sitemap?
Some robot.txt files contain a sitemap link.
This is a link to a sitemap which is an XML (extensible markup language) file.
The sitemap contains a structured list of the URLs of the resources on the website plus some data about them. The sitemap provides useful information to bots crawling for search engines about the
resources.
Web scraping in Python requires the use of third party libraries: What is a LIBRARY
A library is a collection of packages that provides reusable functionality in a particular area.
What is a MODULE?
A module is a single Python file (.py) containing reusable code. A module can contain classes, functions and/or functions.
What is a PACKAGE?
A package is a collection of modules grouped together in a directory structure. A sub-package can be contained within a package.
The words PACKAGE and LIBRARY are often used interrchangably.
How is a python SCRIPT file different to a MODULE?
A script file (.py) is a little different from a module. The code in a script is intended just for that purpose. Whereas the code in a module is intended to be imported and used by several scripts (or
other modules). Hence, the code in a module is reused
What is the hierachy structure of LIBRARIES, PACKAGES and MODULES?
The hierachy is:
Top: LIBRARY which contains one or more:
next: PACKAGE(S) which contains one or more:
next: MODULE(S) which contains one or more:
bottom: CLASS(ES), VARIABLE(S) and / or FUNCTION(S)
So if a LIBRARY contains a function that you want to use in your scipt, you first need to install that LIBRARY (can be an inbuilt Python library - then you don’t need to install). You need to install 3rd party libraries.
Then you need to add code in your script to import the MODULE for the relevant function.
Then ypur script can call on the relevant function.
Import example:
Import Example
Let’s say that a module greetings.py contains two useful functions:
def hello(name):
print(“Hello,”, name)
def goodbye(name):
print(“Goodbye,”, name)
You can run this code from your own script welcome.py by importing the module and calling the function:
import greetings
greetings.hello(‘Jane’)
Including the module name as a prefix of the function
(<module>.<function>) is required.</function></module>
You can the run your script as:
»> python welcome.py
Hello, Jane
Note: The module, script and terminal running must be in the same directory.
How do you call a function contained in a module (what is the syntax)?
module.function()
e.g. random.randint()
How can you create / use an alias for a MODULE?
import greetings as gr
gr.hello(‘Jill’)
How can you find out more information about a module? e.g if module is called greetings
print(dir(greetings))
will diisplay more info about the module
To can add print statements to the script to to see information about the name and file path:
print(greetings.__name__)
print(greetings.__file__)
How can you simplify code if you only want to use a specific function from a module?
Instead of importing all the code from a module by using import greetings
you can import one function from the module by:
from greetings import goodbye
then you don’t need to refer to the module name when using the function. You can just use goodbye(‘Jill’ instead of greetings.goodbye(‘Jill’)
Info about Modules full path
Sometimes scripts must include the package name as a prefix of the module name (<package>.<module>).
This avoids any confusion about which module is being used.</module></package>
For example, the following script code utilises the robotparser module from the urllib package.
import urllib.robotparser
rp = urllib.robotparser.RobotFileParser()
A work-around is to restrict the module to just one class:
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()