15 Word & PDF Files Flashcards
Why is working with Word and PDF docs more complex?
Because the data type is binary!
In addition to text, they store lots of font, color, and layout information.
Given a PDF File, how do you
- Find out how many pages it has
- Get the text from Page 2?
PyPDF2 uses a 0-based-index
>>> import PyPDF2 >>> pdfFileObj = open('name.pdf', 'rb') >>> pdfReader = PyPDF2.PdfFileReader(pdfFileObj) >>> pdfReader.numPages >>> pageObj = pdfReader.getPage(1) >>> pageObj.extractText() >>> pdfFileObj.close()
What is a problem with PyDF2
- Might be unable to read PDF Files
2. Or not display content correctly
How can you decrypt a PDF? What do you need to consider with PyPDF2?
What happens if you dnt consider it?
> > > import PyPDF2
pdfReader = PyPDF2.PdfFileReader(open(‘encrypted.pdf’, ‘rb’))
pdfReader.decrypt(‘password’)
pageObj = pdfReader.getPage(0)
NOTE:
To be able to create a pageObj after decrypting, you always need to re-open the pdfReader = … variable first.
Otherwise: IndexError “index out of range”
Which 2 functions allow you to merge PDF Files?
*.getPage(i) and *.addPage()
pageObj = pdfReader.getPage(pageNum)
pdfWriter.addPage(pageObj)
*How can you rotate a PDF page?
- Create a page object
»>page = pdfReader.getPage(0) - degree = 90 | 180 | 270
»>page.rotateClockwise(degree) - Write a new PDF
»> resultPdfFile = open(‘rotatedPage.pdf’, ‘wb’)
»> pdfWriter.write(resultPdfFile)
*How can you overlay (merge) PDF pages?
- Open both files in ‘rb’
Read the files
Create Page Object
2.
fuse both objects together
»>pageObj1.mergePage(pageObj2)
3.
Write a new PDF File and use *.addpage(pageObj1)
_____________
# 1. (repeat steps for second file)
minutesFile = open(‘meetingminutes.pdf’, ‘rb’)
pdf1Reader = PyPDF2.PdfFileReader(minutesFile)
minutesFirstPage = pdf1Reader.getPage(0)
# 2. minutesFirstPage.mergePage(waterFirstPage)
# 3. pdfWriter = PyPDF2.PdfFileWriter() pdfWriter.addPage(minutesFirstPage)
resultPdfFile = open(‘watermarkedCover.pdf’, ‘wb’)
pdfWriter.write(resultPdfFile)
minutesFile.close()
resultPdfFile.close()
How can you encrypt PDF Files?
add *.encrypt(‘password) to the pdfWriter variable
____
»_space;> import PyPDF2
»_space;> pdfFile = open(‘meetingminutes.pdf’, ‘rb’)
»_space;> pdfReader = PyPDF2.PdfFileReader(pdfFile)
»_space;> pdfWriter = PyPDF2.PdfFileWriter()
»_space;> for pageNum in range(pdfReader.numPages):
pdfWriter.addPage(pdfReader.getPage(pageNum))
➊»_space;> pdfWriter.encrypt(‘swordfish’)
»_space;> resultPdf = open(‘encryptedminutes.pdf’, ‘wb’)
»_space;> pdfWriter.write(resultPdf)
»_space;> resultPdf.close()
What is the structure of binary Word Documents?
They have 3 layers:
- Document Object (whole doc)
- paragraphs (whenever user presses Enter)
- Runs (Each paragraph contains one or more)
How can you read:
- how many paragraphs a word doc has
- the content of the doc´s first paragraph
- how many runs the second paragraph holds
- the content of the sencond paragraphs first run
> > > import docx
doc = docx.Document(‘doc.docx’)
- how many paragraphs a word doc has
»> len(doc.paragraphs) - the content of the doc´s first paragraph
»> doc.paragraphs[0].text - how many runs the second paragraph holds
len(doc.paragraphs[2].runs) - the content of the second paragraphs first run
How can you get the full Text from a file?
E.g. by creating a function which can be called later:
import docx def getText(filename): doc = docx.Document(filename) fullText = [] for pin doc.paragraphs: fullText.append(p.text) return '\n'.join(fullText)
save as: readDocx -> CALL:
import readDocx
print(readDocx.getText(‘d.docx’)
How can you style:
- Paragrahps
- runs
Name 3x style attributes
- Paragrahps: *.style = ‘attribute’
»>doc.paragraphs[0].style = ‘Normal’ - runs *.style = ‘attributeChar
»>doc.paragraphs[1].runs[0].style = ‘QuoteChar’
3x Attributes:
‘Body Text’
‘Heading 1’ (up to 9)
‘List Bullet’
Runs have more attributes with boolean values, e.g.:
*.bold = True
How can you write 2x paragraphs to a word document and later on add text to the first paragraph?
> > > import docx
doc = docx.Document()
doc.add_paragraph(‘Hello world!’)
> > > paraObj1 = doc.add_paragraph(‘This is a second paragraph.’)
paraObj2 = doc.add_paragraph(‘This is a yet another paragraph.’)
paraObj1.add_run(‘ This text is being added to the second paragraph.’)
> > > doc.save(‘multipleParagraphs.docx’)
How do you add a header to a word doc?
> > > doc.add_heading(‘Header’, 0)
Header numbers 0 to 4
Add a line and a page break to text in a word doc
Line break: use add_break() on run object
Page break:
»> doc = docx.Document()
»> doc.add_paragraph(‘This is on the first page!’)
»> doc.paragraphs[0].runs[0].add_break(docx.enum.text.WD_BREAK.PAGE)