Lecture 11 Revision Flashcards by Sunshine Stars

Recap: What is a protocol?

A protocol is a set of rules, conventions, and data formats that
govern how data is transmitted, received, and interpreted by
interconnected devices or programs.

How well did you know this?

Not at all

Perfectly

What protocol defines how messages are sent between computers via the internet?

TCP /IP

Transmission Control Protocol / Internet Protocol.

How well did you know this?

Not at all

Perfectly

Recap: What is the Internet?

The Internet is a global network of interconnected computers that communicate using standardised protocols (such as TCP/IP) to exchange data and enable services.

How well did you know this?

Not at all

Perfectly

Recap: What is the World Wide Web (www)?

The WWW is all the public websites and pages held on remote computer (servers) that users can access on their local computers (clients) via the Internet.

The WWW relies on:
- standardised protocols that allow the browser application on the client, to communicate with the server software on the server computer
- standardised file formats and rendering rules that allow clients to consistently display the web pages to the user

The server computer stores web pages. The browser downloads webpages and renders web pages on the screen.

The two protocols relied on are HTTP (for downloading web pages) and HTML (specifying the format of files containing webpages).

How well did you know this?

Not at all

Perfectly

Recap: www Protocols. What is HTTP?

The HyperText Transfer Protocol (HTTP) defines how a browser and a server communicate with each other on the World Web Web.

Browsers send requests to servers.
Servers send responses to browsers.

There are two main types of requests:
A GET request is sent to retrieve a resource from a web server.
A POST request is sent to transfer a large amount of data to a web server.

Hypertext Transfer Protocol Secure (HTTPS) is the encrypted
version of HTTP.

How well did you know this?

Not at all

Perfectly

Recap: www Protocols. What is HTML?

HyperText Markup Language (HTML) defines the format of a web file and how it’s content should be rendered on screen.

An HTML file is an ASCII (text) file.

An HTML file contains embedded elements.
An element is a building block of an HTML document. An element contains content and information about organisation of the page.
Typically, an element consists of:
an opening tag
content
a closing tag
tages are enclosed in <> and the letter inside them indicates the type of element. (e.g <p> is a paragraph).

A web resource is uniquely identified on the WWW by its Uniform Resource Locator (URL). A URL syntax is:

<protocol>//<domain>/<path>/<filename>

For example:
https://www.ucd.ie/about-ucd/index.html

The browser normally shows the URL of the current web page.
Often, the HTML filename is not displayed. The default is index.html.
</filename></path></domain></protocol>

How well did you know this?

Not at all

Perfectly

What is web scraping?

Web scraping is the process of automatically extracting data from websites.
Web scraping involves:
- fetching the web page(s)
- parsing the HTML to extract specific pieces of information

Scraping is useful when you want to gather data from web pages that don’t provide an Application Programming Interface (API) for direct access.

How well did you know this?

Not at all

Perfectly

What is a web crawler?

A web crawler (a.k.a. web spider, web robot or bot) is a program that systematically browses and collects information by visiting web pages.

How well did you know this?

Not at all

Perfectly

What is robots.txt?

Websites often have a robots.txt file that specifies which pages or sections of the site can be crawled or scraped. Most legitimate bots follow the rules set in robots.txt,
… But there is no way to enforce it for malicious bots.

Structure of a robots.txt file:
Following is an example of a robots.txt:
User-agent: *
Disallow: /private/
Allow: /public/

User-agent specifies the bots that the next rules apply to. The wildcard (*) indicates that the rules apply to all bots.
Disallow specifies a path which the bot should not visit.
Allow specifies a path which the bot can visit, even if a broader Disallow directive is given.
The Disallow and Allow rules apply to all of the files and directories under the specified path.

In a robots.txt file, a hash symbol (#) can be used as a start of a line comment marker.

How well did you know this?

Not at all

Perfectly

What does this robots.txt mean?
User-agent: *
Disallow:

ALL web crawlers (user-agent *) are allowed access to ALL content (because there is nothing in the disallow:)

How well did you know this?

Not at all

Perfectly

What does this robots.txt mean?
User-agent: *
Disallow: /

ALL web crawlers are disallowed ALL access. The / means DISALLOW ALL.

How well did you know this?

Not at all

Perfectly

What are the ethical considerations when web scraping?

Respect robots.txt
Avoid Overloading Servers: Bots can generate a lot of requests very quickly that can overload the server. Scraping too frequently can put unnecessary load on the server. A programmer should put in appropriate delays between requests.
Legal Considerations: Always check the site’s terms of service to ensure scraping is allowed. Some websites explicitly prohibit scraping. E.g. https://www.ucd.ie/disclaimer/

How well did you know this?

Not at all

Perfectly

What is a sitemap?

Some robot.txt files contain a sitemap link.

This is a link to a sitemap which is an XML (extensible markup language) file.
The sitemap contains a structured list of the URLs of the resources on the website plus some data about them. The sitemap provides useful information to bots crawling for search engines about the
resources.

How well did you know this?

Not at all

Perfectly

Web scraping in Python requires the use of third party libraries: What is a LIBRARY

A library is a collection of packages that provides reusable functionality in a particular area.

How well did you know this?

Not at all

Perfectly

What is a MODULE?

A module is a single Python file (.py) containing reusable code. A module can contain classes, functions and/or functions.

How well did you know this?

Not at all

Perfectly

What is a PACKAGE?

Study These Flashcards

A package is a collection of modules grouped together in a directory structure. A sub-package can be contained within a package.

The words PACKAGE and LIBRARY are often used interrchangably.

How is a python SCRIPT file different to a MODULE?

Study These Flashcards

A script file (.py) is a little different from a module. The code in a script is intended just for that purpose. Whereas the code in a module is intended to be imported and used by several scripts (or
other modules). Hence, the code in a module is reused

What is the hierachy structure of LIBRARIES, PACKAGES and MODULES?

Study These Flashcards

The hierachy is:

Top: LIBRARY which contains one or more:
next: PACKAGE(S) which contains one or more:
next: MODULE(S) which contains one or more:
bottom: CLASS(ES), VARIABLE(S) and / or FUNCTION(S)

So if a LIBRARY contains a function that you want to use in your scipt, you first need to install that LIBRARY (can be an inbuilt Python library - then you don’t need to install). You need to install 3rd party libraries.
Then you need to add code in your script to import the MODULE for the relevant function.
Then ypur script can call on the relevant function.

Import example:

Study These Flashcards

Import Example
Let’s say that a module greetings.py contains two useful functions:

def hello(name):
print(“Hello,”, name)

def goodbye(name):
print(“Goodbye,”, name)

You can run this code from your own script welcome.py by importing the module and calling the function:
import greetings
greetings.hello(‘Jane’)

Including the module name as a prefix of the function
(<module>.<function>) is required.</function></module>

You can the run your script as:
»> python welcome.py
Hello, Jane

Note: The module, script and terminal running must be in the same directory.

How do you call a function contained in a module (what is the syntax)?

Study These Flashcards

module.function()

e.g. random.randint()

How can you create / use an alias for a MODULE?

Study These Flashcards

import greetings as gr

gr.hello(‘Jill’)

How can you find out more information about a module? e.g if module is called greetings

Study These Flashcards

print(dir(greetings))

will diisplay more info about the module

To can add print statements to the script to to see information about the name and file path:

print(greetings.__name__)
print(greetings.__file__)

How can you simplify code if you only want to use a specific function from a module?

Study These Flashcards

Instead of importing all the code from a module by using import greetings

you can import one function from the module by:

from greetings import goodbye

then you don’t need to refer to the module name when using the function. You can just use goodbye(‘Jill’ instead of greetings.goodbye(‘Jill’)

Info about Modules full path

Study These Flashcards

Sometimes scripts must include the package name as a prefix of the module name (<package>.<module>).
This avoids any confusion about which module is being used.</module></package>

For example, the following script code utilises the robotparser module from the urllib package.

import urllib.robotparser
rp = urllib.robotparser.RobotFileParser()

A work-around is to restrict the module to just one class:
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()

What is the most popular (best) library in Python3 for working with HTTP?

requests (3rd party) Thetre are two standard python moduels (in 2 and 3) urllib and urllib2 but requests is better. requests uses urllib3

What does the function requests.get() do and how is it used?

the function requests.get downloads a specified web page from the specified url e.g a ucd url You put the url in the () The get function sends a request to the ucd.ie server. The function requests the index.html at the given url. The server responds by sending the file to the python app. The webpage is stored in a new object. The object is assigned to the variable: r_get. r_get has an attribute called status.code The number stored in the status.code attribute indicates if the download was a success. Status code 200 means the download was ok. You can build your script to say if the status code == 200 to save the html code to a file. If status code is not 200 you can tell it to print an error message.

List the other requests features

1. Implements HTML methods GET, POST, PUT, DELETE, HEAD, OPTIONS. 2. Handling URL parameters Add query parameters added to a URL 3. Handling JSON Responses JavaScript Object Notation is of text-based file format for storing and exchanging data between a server and a web app or between apps. 4. Handling status codes, timeouts and cookies 5. Supports file uploads and downloads 6. Deals with authentication 7. Deals with redirection handling 8. Deals with session management

Name a popular library for PARSING HTML files

Beautiful Soup Offically called BeautifulSoup4

What is a list comprehension?

A list comprehension is a concise way to create lists. It is equivilant to using a for loop to append to a list.

What are the key features of Beautiful Soup (what can it do)?

Beautiful Soup Features 1. HTML and XML Parsing Parses and processes documents into a structured format (i.e. a tree). 2. Searching for elements By tag name, attribute or text content 3. Navigation By parents, children or siblings. 4. Editing HTML Modifying tags, attributes or text

What is the urllib used for?

urllib contains modules for url handling

What is the try and except structure?

The try and except structure is designed to deal with code that may fail due to an error, e.g. a URL that doesn't exist. The code that might fail is indented after the try keyword line. This is tried first. If an exception occurs, execution goes straight to the except statements. These are exception handlers. The exception handler is the code indented after the key word except.

Lecture 11 Revision Flashcards

(33 cards)