Chapters 12&13: Networked Programmes and Using Web Services Flashcards
socket
like a file, but a 2-way connection between two programmes. Can edit and read from either side.
http protocol
set of rules to determine which end of a socket goes first, what they do, the response to that message, who sends next message, etc.
simple HTTP protocol programme which:
- connects to port 80 of www.py4e.com
- contains a loop which receives data in 512 character chunks from socket and prints until there is no data left
import socket
mysock=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect((‘data.pr4e.org’, 80))
cmd=’GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n’.encode()
mysock.send(cmd)
while True: data=mysock.recv(512) if len(data)<1: break print(data.decode(),end='')
mysock.close()
\r\n
\r\n\r\n
EOL (end of line)
signifies nothing between 2 EOL sequences, equiv to blank line
output of above programme
sends headers on each line with info eg date, server, last modified, content-length, content-type (eg text/plain)
Then a blank line, then the actual data from romeo.txt.
encode/decode meaning, and alternative
HTTP protocol requires data to be received and sent as bytes objects, not strings. hence these convert from string to bytes and back again.
‘x’.encode() is equiv to b’x’
programme to retrieve an image:
body type
number of characters received per call
http://www.py4e.com/code3/urljpeg.py
-rather than copying data to screen, we accumulate the data in a string, trim off headers, and save image data to a file
image/jpeg
3200-5120 - depends how much data has been sent from server before the call, ie network speed. last call can be less than 3200 if less than that left.
how to slow down successive recv() calls to more consistently get specified number of characters (5120 above):
time.sleep() (not sure where)
This gives a 0.25s delay, allows more consistent retrieval of 5120 chars unless speed v poor. may make it take longer obvs
easier, less manual way of retrieving data using urllib (similar to getting a file):
import urllib.request
fhand=urllib.request.urlopen(‘http://data.pr4e.org/romeo.txt’)
for line in fhand:
print(line.decode().strip())
find frequency of words in a web page
same as above for line in fhand: words=line.decode().split() for word in words: counts[word]=counts.get(word, 0)+1 print(counts)
retrieve binary file (eg image of vid) and save as a file locally
import=urllib.request, urllib.parse, urllib.error
img=urllib.request.urlopen(‘http://data.pr4e.org/cover3.jpg’).read()
fhand=open(‘cover3.jpg’, ‘wb’)
fhand.write(img)
fhand.close()
wb argument for open opens a binary file for writing only.
if file too big (will take too long) - break into chunks and write bit by bit (100000 chars at a time):
..... fhand=open('cover3.jpg', 'wb') size=0 while True: info=img.read(100000) if len(info)<1: break size=size+len(info) fhand.write(info)
print(size, ‘charactcers copied.’)
fhand.close()
web scraping
write a programme that pretends to be a web browser and retrieves pages, then looks for patterns in those pages’ data. this is what google does - detects word frequency and links to a page etc to determine importance of millions of pages, and suggests the best
parse an HTML with regular expression to find links, where links take the form: href=”http://www.blah.. (and you know the rough structure on the link)
# Search for link values within URL input import urllib.request, urllib.parse, urllib.error import re import ssl
# Ignore SSL certificate errors ctx = ssl.create_default_context() ctx.check_hostname = False ctx.verify_mode = ssl.CERT_NONE
url = input(‘Enter - ‘)
html = urllib.request.urlopen(url, context=ctx).read()
links = re.findall(b’href=”(http[s]?://.*?)”’, html)
for link in links:
print(link.decode())
ssl library
how you import it
allows a programme to access websites which strictly enforce HTTPS.
import ssl
read method use here:
returns HTML source code as bytes object rather than returning HTTPResponse object
risk using REs:
solution (vague)
lots of broken HTML pages out there - only using REs may miss some or retrieve bad data.
solved by using a robust HTML parsing library
HTML parsing libraries in python
many of them to help parse HTML, where XML parsers may reject and HTML as improperly formatted. each has its own strengths and weaknesses.
BeautifulSoup (BS)
HTML parsing library which tolerates highly flawed HTML and lets you extract data you need easily.. requires prior downloading.
pip
tool to install Python packages. once installed, can use: pip install (package...) in cmd prompt.
use urllib to read a page, then BeautifulSoup to extract href attributes from the anchor (a) tags
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
# Ignore SSL certificate errors ctx = ssl.create_default_context() ctx.check_hostname = False ctx.verify_mode = ssl.CERT_NONE
url = input('Enter - ') html = urllib.request.urlopen(url, context=ctx).read() soup = BeautifulSoup(html, 'html.parser')
# Retrieve all of the anchor tags tags = soup('a') for tag in tags: print(tag.get('href', None))
above programme explained
prompts for a web address, opens page, reads data, passes data to BS parser, retrieves all anchor tags, prints out href attribute for each tag.
Output: includes HTML anhor tags which are relative paths (end in eg .html) or in-page references (eg #). Unlike out RE programme, this doesnts select for those starting with http:// etc
use BS to pull out various parts of each tag (* denotes a line is different to programme above):
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
# Ignore SSL certificate errors ctx = ssl.create_default_context() ctx.check_hostname = False ctx.verify_mode = ssl.CERT_NONE
url = input('Enter - ') html = urlopen(url, context=ctx).read() #* soup = BeautifulSoup(html, "html.parser")
# Retrieve all of the anchor tags tags = soup('a') for tag in tags: # Look at the parts of a tag print('TAG:', tag) #* and all the rest! print('URL:', tag.get('href', None)) print('Contents:', tag.contents[0]) print('Attrs:', tag.attrs)