Chapters 12&13: Networked Programmes and Using Web Services Flashcards

Question 1

Q

socket

Answer

A

like a file, but a 2-way connection between two programmes. Can edit and read from either side.

Question 2

Q

http protocol

Answer

A

set of rules to determine which end of a socket goes first, what they do, the response to that message, who sends next message, etc.

Question 3

Q

simple HTTP protocol programme which:

connects to port 80 of www.py4e.com
contains a loop which receives data in 512 character chunks from socket and prints until there is no data left

Answer

A

import socket

mysock=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect((‘data.pr4e.org’, 80))
cmd=’GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n’.encode()
mysock.send(cmd)

while True:
   data=mysock.recv(512)
   if len(data)<1:
      break
   print(data.decode(),end='')

mysock.close()

Question 4

Q

\r\n

\r\n\r\n

Answer

A

EOL (end of line)

signifies nothing between 2 EOL sequences, equiv to blank line

Question 5

Q

output of above programme

Answer

A

sends headers on each line with info eg date, server, last modified, content-length, content-type (eg text/plain)

Then a blank line, then the actual data from romeo.txt.

Question 6

Q

encode/decode meaning, and alternative

Answer

A

HTTP protocol requires data to be received and sent as bytes objects, not strings. hence these convert from string to bytes and back again.
‘x’.encode() is equiv to b’x’

Question 7

Q

programme to retrieve an image:

body type
number of characters received per call

Answer

A

http://www.py4e.com/code3/urljpeg.py
-rather than copying data to screen, we accumulate the data in a string, trim off headers, and save image data to a file
image/jpeg
3200-5120 - depends how much data has been sent from server before the call, ie network speed. last call can be less than 3200 if less than that left.

Question 8

Q

how to slow down successive recv() calls to more consistently get specified number of characters (5120 above):

Answer

A

time.sleep() (not sure where)
This gives a 0.25s delay, allows more consistent retrieval of 5120 chars unless speed v poor. may make it take longer obvs

Question 9

Q

easier, less manual way of retrieving data using urllib (similar to getting a file):

Answer

A

import urllib.request

fhand=urllib.request.urlopen(‘http://data.pr4e.org/romeo.txt’)
for line in fhand:
print(line.decode().strip())

Question 10

Q

find frequency of words in a web page

Answer

A

same as above
for line in fhand:
   words=line.decode().split()
   for word in words:
      counts[word]=counts.get(word, 0)+1
print(counts)

Question 11

Q

retrieve binary file (eg image of vid) and save as a file locally

Answer

A

import=urllib.request, urllib.parse, urllib.error

img=urllib.request.urlopen(‘http://data.pr4e.org/cover3.jpg’).read()
fhand=open(‘cover3.jpg’, ‘wb’)
fhand.write(img)
fhand.close()
wb argument for open opens a binary file for writing only.

Question 12

Q

if file too big (will take too long) - break into chunks and write bit by bit (100000 chars at a time):

Answer

A

.....
fhand=open('cover3.jpg', 'wb')
size=0
while True:
   info=img.read(100000)
   if len(info)<1: break
   size=size+len(info)
   fhand.write(info)

print(size, ‘charactcers copied.’)
fhand.close()

Question 13

Q

web scraping

Answer

A

write a programme that pretends to be a web browser and retrieves pages, then looks for patterns in those pages’ data. this is what google does - detects word frequency and links to a page etc to determine importance of millions of pages, and suggests the best

Question 14

Q

parse an HTML with regular expression to find links, where links take the form: href=”http://www.blah.. (and you know the rough structure on the link)

Answer

A

# Search for link values within URL input
import urllib.request, urllib.parse, urllib.error
import re
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input(‘Enter - ‘)
html = urllib.request.urlopen(url, context=ctx).read()
links = re.findall(b’href=”(http[s]?://.*?)”’, html)
for link in links:
print(link.decode())

Question 15

Q

ssl library

how you import it

Answer

A

allows a programme to access websites which strictly enforce HTTPS.
import ssl

Question 16

Q

read method use here:

Answer

Study These Flashcards

A

returns HTML source code as bytes object rather than returning HTTPResponse object

Question 17

Q

risk using REs:

solution (vague)

Answer

Study These Flashcards

A

lots of broken HTML pages out there - only using REs may miss some or retrieve bad data.
solved by using a robust HTML parsing library

Question 18

Q

HTML parsing libraries in python

Answer

Study These Flashcards

A

many of them to help parse HTML, where XML parsers may reject and HTML as improperly formatted. each has its own strengths and weaknesses.

Question 19

Q

BeautifulSoup (BS)

Answer

Study These Flashcards

A

HTML parsing library which tolerates highly flawed HTML and lets you extract data you need easily.. requires prior downloading.

Question 20

Q

pip

Answer

Study These Flashcards

A

tool to install Python packages. once installed, can use:
pip install (package...) in cmd prompt.

Question 21

Q

use urllib to read a page, then BeautifulSoup to extract href attributes from the anchor (a) tags

Answer

Study These Flashcards

A

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')

# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
    print(tag.get('href', None))

Question 22

Q

above programme explained

Answer

Study These Flashcards

A

prompts for a web address, opens page, reads data, passes data to BS parser, retrieves all anchor tags, prints out href attribute for each tag.
Output: includes HTML anhor tags which are relative paths (end in eg .html) or in-page references (eg #). Unlike out RE programme, this doesnts select for those starting with http:// etc

Question 23

Q

use BS to pull out various parts of each tag (* denotes a line is different to programme above):

Answer

Study These Flashcards

A

from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
html = urlopen(url, context=ctx).read() #*
soup = BeautifulSoup(html, "html.parser")

# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
    # Look at the parts of a tag
    print('TAG:', tag) #* and all the rest!
    print('URL:', tag.get('href', None))
    print('Contents:', tag.contents[0])
    print('Attrs:', tag.attrs)

Chapters 12&13: Networked Programmes and Using Web Services Flashcards

(23 cards)