strings Flashcards
chr() and ord()
ord(b’A’)
returns 65
chr(65)
returns ‘A’. It doesn’t return b’A’
a standard string (the output of chr()), doesn’t work exactly like a byte string.
For this, there is the struct module.
struct module
Performs complex conversions
chr and ord can only be used with bytes. Converting numbers to bytes limits values to be from 0 to 255.
struct. pack() writes out byte strings
struct. unpack() reads those values back into python.
>>>import struct
>>> struct.pack(b’B’, 65)
… b’A’
>>> struct.pack(b’B’, 33)
… b’!’
>>>struct.pack(b’BBBBBBB’, 69, 120, 97, 109,etc….)
… b’Example’
if the input is signed, there are 256 values, but ranging -128 to 127
>>> struct.pack(b’b’, 65, -23)
… b’A\xe9’
lowercase assumes signed value52
for two byte numbers, use H and h
there are 65,536 values possible
unpack
>>> struct.unpack(b’H’, b’\x00*’)
(10752, )
>>> struct.unpack(b’H’, b’*\x00’)
(42, )
pack and unpack are true inverses.
##
Four byte numbers use I and i, Eight byte numbers use Q and q
Endianness
Term for how the bytes of a value are ordered.
- Big Endianness: the byte that provides the largest part of the number gets stored first
- Little endianness: the byte that stores the smallest part of the number gets stored first.
’<’ before the format specification(‘H’, ‘h’, ‘B’, or ‘b’) signifies “little endianness”
’>’ is the opposite.
Little endian is typically used on modern systems
Converting strings
Not terribly interesting.
consider:
>>> first_name = ‘Marty’
>>> last_name = ‘Alchin’
>>> age = 28
>>> data = struct.pack(b’10s10sB’, last_name, first_name, age)
>>> data
b’Alchin\x00\x00\x00\x00Marty\x00\x00\x00\x00\x00\x1c’
sorta handy, I guess.
Text
History:
ASCII - American Standard Code for Information Interchange
127 characters, 95 of them printable
Only covered 7 bits of each byte, but even another 128 values weren’t enough for the language needs outside of English.
Unicode - standard in Python. The ‘u’ prefix is no longer supported in Python 3. the ‘b’ prefix signifies bytes.
Encoding - usually, don’t need everything in unicode, so there are encodings. “Python string”.encode(‘ascii’) returns the same thing, but in ASCII - one byte characters.
UTF-8 is the most common
UTF-8
Characters within a certain range are a single byte. Some are two bytes, then some are three and even four bytes.
It is desirable because:
- Can support any Unicode code point. Not unique to UTF-8, but better than ascii
- More common the character, the less space it’s code point takes up.
- Single byte range precisely coincides with ascii, meaning it’s perfectly backward compatible with ascii.
string formatting codes
%s = str
%r = repr
plenty more
objects can be inserted by keyword as well
def log(*args):
__for i, arg in enumerate(args)
____(“print this %(i)s: %(arg)r” % {‘i’: i, ‘arg’: arg})
log(‘test, ‘ing’)
Argument 0: ‘test’
Argument 1: ‘ing’
new formatting
New, more robust method:
>>> “this is argument 0: {0}’.format(‘test’)
This is argument 0: test’
>>> “this is argument key: {key}’. format(key=’value’)
‘This is argument key: value’
Because it’s a method call rather than an operator, you can mix positional and keyword arguments, referencing them in any order.
Looking up values within objects
example does best:
>>> import datetime
>>> def format_time(time):
__return ‘{0.minute} past {0.hour}’.format(time)
>>> format_time(datetime.time(8,10))
‘10 past 8’
>>> ‘{0[spam]}’.format({‘spam’:’eggs’})
‘eggs’
Distinguishing types of strings
immediately following the object reference (index or keyword), follow it by a ‘!’ and ‘s’ or ‘r’, depending on what you want.
exact_match is a simple function
from previous exercise
>>> validate_test = exact_match(‘test’, ‘Expected{1!r}, got {0!r}’)
>>> validate_test(‘invalid’)
Traceback…
ValueError: Expected ‘test’, got ‘invalid’
Standard format specification
after the field reference in string formatting you can include a colon followed by a string that controls the formatting
{0:>20}{1} translates to “0 index is 20 characters long and right aligned. Then 1 index”
{0:=^40}.format(text)
translates to ‘take “text” and center it between equal signs, total of 40 characters.’
If the text is longer than 40, format will extend it rather than truncating. There would be no ‘=’ on either side.
custom format specification
pretty dumb,
format() isn’t in control of the formatting syntax described so far.
It instead delegates that control to __format__() defined on the object.
class Verb:
__def __init__(self, present, past = None):
____self.present = present
____self.past = past
__def __format__(self, tense):
____if tense == ‘past’:
______return self.past
____else:
________return self.present
>>> format = Verb(‘format’, past=’formatted’)
>>> message = ‘{0:present} strings with {0:past} objects.’
>>> message.format(format)
format strings with formatted objects.
especially since ‘formatted’, here, is an adjective.