Chapters 12&13: Networked Programmes and Using Web Services Flashcards
socket
like a file, but a 2-way connection between two programmes. Can edit and read from either side.
http protocol
set of rules to determine which end of a socket goes first, what they do, the response to that message, who sends next message, etc.
simple HTTP protocol programme which:
- connects to port 80 of www.py4e.com
- contains a loop which receives data in 512 character chunks from socket and prints until there is no data left
import socket
mysock=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect((‘data.pr4e.org’, 80))
cmd=’GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n’.encode()
mysock.send(cmd)
while True: data=mysock.recv(512) if len(data)<1: break print(data.decode(),end='')
mysock.close()
\r\n
\r\n\r\n
EOL (end of line)
signifies nothing between 2 EOL sequences, equiv to blank line
output of above programme
sends headers on each line with info eg date, server, last modified, content-length, content-type (eg text/plain)
Then a blank line, then the actual data from romeo.txt.
encode/decode meaning, and alternative
HTTP protocol requires data to be received and sent as bytes objects, not strings. hence these convert from string to bytes and back again.
‘x’.encode() is equiv to b’x’
programme to retrieve an image:
body type
number of characters received per call
http://www.py4e.com/code3/urljpeg.py
-rather than copying data to screen, we accumulate the data in a string, trim off headers, and save image data to a file
image/jpeg
3200-5120 - depends how much data has been sent from server before the call, ie network speed. last call can be less than 3200 if less than that left.
how to slow down successive recv() calls to more consistently get specified number of characters (5120 above):
time.sleep() (not sure where)
This gives a 0.25s delay, allows more consistent retrieval of 5120 chars unless speed v poor. may make it take longer obvs
easier, less manual way of retrieving data using urllib (similar to getting a file):
import urllib.request
fhand=urllib.request.urlopen(‘http://data.pr4e.org/romeo.txt’)
for line in fhand:
print(line.decode().strip())
find frequency of words in a web page
same as above for line in fhand: words=line.decode().split() for word in words: counts[word]=counts.get(word, 0)+1 print(counts)
retrieve binary file (eg image of vid) and save as a file locally
import=urllib.request, urllib.parse, urllib.error
img=urllib.request.urlopen(‘http://data.pr4e.org/cover3.jpg’).read()
fhand=open(‘cover3.jpg’, ‘wb’)
fhand.write(img)
fhand.close()
wb argument for open opens a binary file for writing only.
if file too big (will take too long) - break into chunks and write bit by bit (100000 chars at a time):
..... fhand=open('cover3.jpg', 'wb') size=0 while True: info=img.read(100000) if len(info)<1: break size=size+len(info) fhand.write(info)
print(size, ‘charactcers copied.’)
fhand.close()
web scraping
write a programme that pretends to be a web browser and retrieves pages, then looks for patterns in those pages’ data. this is what google does - detects word frequency and links to a page etc to determine importance of millions of pages, and suggests the best
parse an HTML with regular expression to find links, where links take the form: href=”http://www.blah.. (and you know the rough structure on the link)
# Search for link values within URL input import urllib.request, urllib.parse, urllib.error import re import ssl
# Ignore SSL certificate errors ctx = ssl.create_default_context() ctx.check_hostname = False ctx.verify_mode = ssl.CERT_NONE
url = input(‘Enter - ‘)
html = urllib.request.urlopen(url, context=ctx).read()
links = re.findall(b’href=”(http[s]?://.*?)”’, html)
for link in links:
print(link.decode())
ssl library
how you import it
allows a programme to access websites which strictly enforce HTTPS.
import ssl