Why is working with Word and PDF docs more complex?
Because the data type is binary!
In addition to text, they store lots of font, color, and layout information.
Given a PDF File, how do you
PyPDF2 uses a 0-based-index
>>> import PyPDF2
>>> pdfFileObj = open('name.pdf', 'rb')
>>> pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
>>> pdfReader.numPages
>>> pageObj = pdfReader.getPage(1)
>>> pageObj.extractText()
>>> pdfFileObj.close()What is a problem with PyDF2
2. Or not display content correctly
How can you decrypt a PDF? What do you need to consider with PyPDF2?
What happens if you dnt consider it?
> > > import PyPDF2
pdfReader = PyPDF2.PdfFileReader(open(‘encrypted.pdf’, ‘rb’))
pdfReader.decrypt(‘password’)
pageObj = pdfReader.getPage(0)
NOTE:
To be able to create a pageObj after decrypting, you always need to re-open the pdfReader = … variable first.
Otherwise: IndexError “index out of range”
Which 2 functions allow you to merge PDF Files?
*.getPage(i) and *.addPage()
pageObj = pdfReader.getPage(pageNum)
pdfWriter.addPage(pageObj)
*How can you rotate a PDF page?
*How can you overlay (merge) PDF pages?
2.
fuse both objects together
»>pageObj1.mergePage(pageObj2)
3.
Write a new PDF File and use *.addpage(pageObj1)
_____________
# 1. (repeat steps for second file)
minutesFile = open(‘meetingminutes.pdf’, ‘rb’)
pdf1Reader = PyPDF2.PdfFileReader(minutesFile)
minutesFirstPage = pdf1Reader.getPage(0)
# 2. minutesFirstPage.mergePage(waterFirstPage)
# 3. pdfWriter = PyPDF2.PdfFileWriter() pdfWriter.addPage(minutesFirstPage)
resultPdfFile = open(‘watermarkedCover.pdf’, ‘wb’)
pdfWriter.write(resultPdfFile)
minutesFile.close()
resultPdfFile.close()
How can you encrypt PDF Files?
add *.encrypt(‘password) to the pdfWriter variable
____
»_space;> import PyPDF2
»_space;> pdfFile = open(‘meetingminutes.pdf’, ‘rb’)
»_space;> pdfReader = PyPDF2.PdfFileReader(pdfFile)
»_space;> pdfWriter = PyPDF2.PdfFileWriter()
»_space;> for pageNum in range(pdfReader.numPages):
pdfWriter.addPage(pdfReader.getPage(pageNum))
➊»_space;> pdfWriter.encrypt(‘swordfish’)
»_space;> resultPdf = open(‘encryptedminutes.pdf’, ‘wb’)
»_space;> pdfWriter.write(resultPdf)
»_space;> resultPdf.close()
What is the structure of binary Word Documents?
They have 3 layers:
How can you read:
> > > import docx
doc = docx.Document(‘doc.docx’)
How can you get the full Text from a file?
E.g. by creating a function which can be called later:
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for pin doc.paragraphs:
fullText.append(p.text)
return '\n'.join(fullText)save as: readDocx -> CALL:
import readDocx
print(readDocx.getText(‘d.docx’)
How can you style:
Name 3x style attributes
3x Attributes:
‘Body Text’
‘Heading 1’ (up to 9)
‘List Bullet’
Runs have more attributes with boolean values, e.g.:
*.bold = True
How can you write 2x paragraphs to a word document and later on add text to the first paragraph?
> > > import docx
doc = docx.Document()
doc.add_paragraph(‘Hello world!’)
> > > paraObj1 = doc.add_paragraph(‘This is a second paragraph.’)
paraObj2 = doc.add_paragraph(‘This is a yet another paragraph.’)
paraObj1.add_run(‘ This text is being added to the second paragraph.’)
> > > doc.save(‘multipleParagraphs.docx’)
How do you add a header to a word doc?
> > > doc.add_heading(‘Header’, 0)
Header numbers 0 to 4
Add a line and a page break to text in a word doc
Line break: use add_break() on run object
Page break:
»> doc = docx.Document()
»> doc.add_paragraph(‘This is on the first page!’)
»> doc.paragraphs[0].runs[0].add_break(docx.enum.text.WD_BREAK.PAGE)
How do you add pictures?
> > > doc.add_picture(‘zophie.png’, width=docx.shared.Cm(1),
height=docx.shared.Cm(4))
You can use cm or inches