Chapters 12&13: Networked Programmes and Using Web Services Flashcards

1
Q

socket

A

like a file, but a 2-way connection between two programmes. Can edit and read from either side.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

http protocol

A

set of rules to determine which end of a socket goes first, what they do, the response to that message, who sends next message, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

simple HTTP protocol programme which:

  • connects to port 80 of www.py4e.com
  • contains a loop which receives data in 512 character chunks from socket and prints until there is no data left
A

import socket

mysock=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect((‘data.pr4e.org’, 80))
cmd=’GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n’.encode()
mysock.send(cmd)

while True:
   data=mysock.recv(512)
   if len(data)<1:
      break
   print(data.decode(),end='')

mysock.close()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

\r\n

\r\n\r\n

A

EOL (end of line)

signifies nothing between 2 EOL sequences, equiv to blank line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

output of above programme

A

sends headers on each line with info eg date, server, last modified, content-length, content-type (eg text/plain)

Then a blank line, then the actual data from romeo.txt.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

encode/decode meaning, and alternative

A

HTTP protocol requires data to be received and sent as bytes objects, not strings. hence these convert from string to bytes and back again.
‘x’.encode() is equiv to b’x’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

programme to retrieve an image:

body type
number of characters received per call

A

http://www.py4e.com/code3/urljpeg.py
-rather than copying data to screen, we accumulate the data in a string, trim off headers, and save image data to a file
image/jpeg
3200-5120 - depends how much data has been sent from server before the call, ie network speed. last call can be less than 3200 if less than that left.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

how to slow down successive recv() calls to more consistently get specified number of characters (5120 above):

A

time.sleep() (not sure where)
This gives a 0.25s delay, allows more consistent retrieval of 5120 chars unless speed v poor. may make it take longer obvs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

easier, less manual way of retrieving data using urllib (similar to getting a file):

A

import urllib.request

fhand=urllib.request.urlopen(‘http://data.pr4e.org/romeo.txt’)
for line in fhand:
print(line.decode().strip())

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

find frequency of words in a web page

A
same as above
for line in fhand:
   words=line.decode().split()
   for word in words:
      counts[word]=counts.get(word, 0)+1
print(counts)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

retrieve binary file (eg image of vid) and save as a file locally

A

import=urllib.request, urllib.parse, urllib.error

img=urllib.request.urlopen(‘http://data.pr4e.org/cover3.jpg’).read()
fhand=open(‘cover3.jpg’, ‘wb’)
fhand.write(img)
fhand.close()
wb argument for open opens a binary file for writing only.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

if file too big (will take too long) - break into chunks and write bit by bit (100000 chars at a time):

A
.....
fhand=open('cover3.jpg', 'wb')
size=0
while True:
   info=img.read(100000)
   if len(info)<1: break
   size=size+len(info)
   fhand.write(info)

print(size, ‘charactcers copied.’)
fhand.close()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

web scraping

A

write a programme that pretends to be a web browser and retrieves pages, then looks for patterns in those pages’ data. this is what google does - detects word frequency and links to a page etc to determine importance of millions of pages, and suggests the best

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

parse an HTML with regular expression to find links, where links take the form: href=”http://www.blah.. (and you know the rough structure on the link)

A
# Search for link values within URL input
import urllib.request, urllib.parse, urllib.error
import re
import ssl
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input(‘Enter - ‘)
html = urllib.request.urlopen(url, context=ctx).read()
links = re.findall(b’href=”(http[s]?://.*?)”’, html)
for link in links:
print(link.decode())

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

ssl library

how you import it

A

allows a programme to access websites which strictly enforce HTTPS.
import ssl

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

read method use here:

A

returns HTML source code as bytes object rather than returning HTTPResponse object

17
Q

risk using REs:

solution (vague)

A

lots of broken HTML pages out there - only using REs may miss some or retrieve bad data.
solved by using a robust HTML parsing library

18
Q

HTML parsing libraries in python

A

many of them to help parse HTML, where XML parsers may reject and HTML as improperly formatted. each has its own strengths and weaknesses.

19
Q

BeautifulSoup (BS)

A

HTML parsing library which tolerates highly flawed HTML and lets you extract data you need easily.. requires prior downloading.

20
Q

pip

A
tool to install Python packages. once installed, can use:
pip install (package...) in cmd prompt.
21
Q

use urllib to read a page, then BeautifulSoup to extract href attributes from the anchor (a) tags

A

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter - ')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
    print(tag.get('href', None))
22
Q

above programme explained

A

prompts for a web address, opens page, reads data, passes data to BS parser, retrieves all anchor tags, prints out href attribute for each tag.
Output: includes HTML anhor tags which are relative paths (end in eg .html) or in-page references (eg #). Unlike out RE programme, this doesnts select for those starting with http:// etc

23
Q

use BS to pull out various parts of each tag (* denotes a line is different to programme above):

A

from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter - ')
html = urlopen(url, context=ctx).read() #*
soup = BeautifulSoup(html, "html.parser")
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
    # Look at the parts of a tag
    print('TAG:', tag) #* and all the rest!
    print('URL:', tag.get('href', None))
    print('Contents:', tag.contents[0])
    print('Attrs:', tag.attrs)