15 Word & PDF Files Flashcards

Question 1

Q

Why is working with Word and PDF docs more complex?

Answer

A

Because the data type is binary!

In addition to text, they store lots of font, color, and layout information.

Question 2

Q

Given a PDF File, how do you

Find out how many pages it has
Get the text from Page 2?

Answer

A

PyPDF2 uses a 0-based-index

>>> import PyPDF2
>>> pdfFileObj = open('name.pdf', 'rb')
>>> pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
>>> pdfReader.numPages
>>> pageObj = pdfReader.getPage(1)
>>> pageObj.extractText()
>>> pdfFileObj.close()

Question 3

Q

What is a problem with PyDF2

Answer

A

Might be unable to read PDF Files

2. Or not display content correctly

Question 4

Q

How can you decrypt a PDF? What do you need to consider with PyPDF2?
What happens if you dnt consider it?

Answer

A

> > > import PyPDF2
pdfReader = PyPDF2.PdfFileReader(open(‘encrypted.pdf’, ‘rb’))
pdfReader.decrypt(‘password’)
pageObj = pdfReader.getPage(0)

NOTE:
To be able to create a pageObj after decrypting, you always need to re-open the pdfReader = … variable first.
Otherwise: IndexError “index out of range”

Question 5

Q

Which 2 functions allow you to merge PDF Files?

Answer

A

*.getPage(i) and *.addPage()

pageObj = pdfReader.getPage(pageNum)
pdfWriter.addPage(pageObj)

Question 6

Q

*How can you rotate a PDF page?

Answer

A

Create a page object
»>page = pdfReader.getPage(0)
degree = 90 | 180 | 270
»>page.rotateClockwise(degree)
Write a new PDF
»> resultPdfFile = open(‘rotatedPage.pdf’, ‘wb’)
»> pdfWriter.write(resultPdfFile)

Question 7

Q

*How can you overlay (merge) PDF pages?

Answer

A

Open both files in ‘rb’
Read the files
Create Page Object

2.
fuse both objects together
»>pageObj1.mergePage(pageObj2)

3.
Write a new PDF File and use *.addpage(pageObj1)
_____________
# 1. (repeat steps for second file)
minutesFile = open(‘meetingminutes.pdf’, ‘rb’)
pdf1Reader = PyPDF2.PdfFileReader(minutesFile)
minutesFirstPage = pdf1Reader.getPage(0)

# 2.
minutesFirstPage.mergePage(waterFirstPage)

# 3.
pdfWriter = PyPDF2.PdfFileWriter()
pdfWriter.addPage(minutesFirstPage)

resultPdfFile = open(‘watermarkedCover.pdf’, ‘wb’)
pdfWriter.write(resultPdfFile)
minutesFile.close()
resultPdfFile.close()

Question 8

Q

How can you encrypt PDF Files?

Answer

A

add *.encrypt(‘password) to the pdfWriter variable

____
&raquo_space;> import PyPDF2
&raquo_space;> pdfFile = open(‘meetingminutes.pdf’, ‘rb’)
&raquo_space;> pdfReader = PyPDF2.PdfFileReader(pdfFile)
&raquo_space;> pdfWriter = PyPDF2.PdfFileWriter()
&raquo_space;> for pageNum in range(pdfReader.numPages):
pdfWriter.addPage(pdfReader.getPage(pageNum))

➊&raquo_space;> pdfWriter.encrypt(‘swordfish’)
&raquo_space;> resultPdf = open(‘encryptedminutes.pdf’, ‘wb’)
&raquo_space;> pdfWriter.write(resultPdf)
&raquo_space;> resultPdf.close()

Question 9

Q

What is the structure of binary Word Documents?

Answer

A

They have 3 layers:

Document Object (whole doc)
paragraphs (whenever user presses Enter)
Runs (Each paragraph contains one or more)

Question 10

Q

How can you read:

how many paragraphs a word doc has
the content of the doc´s first paragraph
how many runs the second paragraph holds
the content of the sencond paragraphs first run

Answer

A

> > > import docx
doc = docx.Document(‘doc.docx’)

how many paragraphs a word doc has
»> len(doc.paragraphs)
the content of the doc´s first paragraph
»> doc.paragraphs[0].text
how many runs the second paragraph holds
len(doc.paragraphs[2].runs)
the content of the second paragraphs first run

Question 11

Q

How can you get the full Text from a file?

Answer

A

E.g. by creating a function which can be called later:

import docx
def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for pin doc.paragraphs:
        fullText.append(p.text)
    return '\n'.join(fullText)

save as: readDocx -> CALL:
import readDocx
print(readDocx.getText(‘d.docx’)

Question 12

Q

How can you style:

Paragrahps
runs

Name 3x style attributes

Answer

A

Paragrahps: *.style = ‘attribute’
»>doc.paragraphs[0].style = ‘Normal’
runs *.style = ‘attributeChar
»>doc.paragraphs[1].runs[0].style = ‘QuoteChar’

3x Attributes:
‘Body Text’
‘Heading 1’ (up to 9)
‘List Bullet’

Runs have more attributes with boolean values, e.g.:
*.bold = True

Question 13

Q

How can you write 2x paragraphs to a word document and later on add text to the first paragraph?

Answer

A

> > > import docx
doc = docx.Document()
doc.add_paragraph(‘Hello world!’)

> > > paraObj1 = doc.add_paragraph(‘This is a second paragraph.’)
paraObj2 = doc.add_paragraph(‘This is a yet another paragraph.’)
paraObj1.add_run(‘ This text is being added to the second paragraph.’)

> > > doc.save(‘multipleParagraphs.docx’)

Question 14

Q

How do you add a header to a word doc?

Answer

A

> > > doc.add_heading(‘Header’, 0)

Header numbers 0 to 4

Question 15

Q

Add a line and a page break to text in a word doc

Answer

A

Line break:
use add_break() on run object

Page break:
»> doc = docx.Document()
»> doc.add_paragraph(‘This is on the first page!’)
»> doc.paragraphs[0].runs[0].add_break(docx.enum.text.WD_BREAK.PAGE)

Question 16

Q

How do you add pictures?

Answer

Study These Flashcards

A

> > > doc.add_picture(‘zophie.png’, width=docx.shared.Cm(1),
height=docx.shared.Cm(4))

You can use cm or inches

15 Word & PDF Files Flashcards

(16 cards)