15 Word & PDF Files Flashcards

1
Q

Why is working with Word and PDF docs more complex?

A

Because the data type is binary!

In addition to text, they store lots of font, color, and layout information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Given a PDF File, how do you

  1. Find out how many pages it has
  2. Get the text from Page 2?
A

PyPDF2 uses a 0-based-index

>>> import PyPDF2
>>> pdfFileObj = open('name.pdf', 'rb')
>>> pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
>>> pdfReader.numPages
>>> pageObj = pdfReader.getPage(1)
>>> pageObj.extractText()
>>> pdfFileObj.close()
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a problem with PyDF2

A
  1. Might be unable to read PDF Files

2. Or not display content correctly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How can you decrypt a PDF? What do you need to consider with PyPDF2?
What happens if you dnt consider it?

A

> > > import PyPDF2
pdfReader = PyPDF2.PdfFileReader(open(‘encrypted.pdf’, ‘rb’))
pdfReader.decrypt(‘password’)
pageObj = pdfReader.getPage(0)

NOTE:
To be able to create a pageObj after decrypting, you always need to re-open the pdfReader = … variable first.
Otherwise: IndexError “index out of range”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Which 2 functions allow you to merge PDF Files?

A

*.getPage(i) and *.addPage()

pageObj = pdfReader.getPage(pageNum)
pdfWriter.addPage(pageObj)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

*How can you rotate a PDF page?

A
  1. Create a page object
    »>page = pdfReader.getPage(0)
  2. degree = 90 | 180 | 270
    »>page.rotateClockwise(degree)
  3. Write a new PDF
    »> resultPdfFile = open(‘rotatedPage.pdf’, ‘wb’)
    »> pdfWriter.write(resultPdfFile)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

*How can you overlay (merge) PDF pages?

A
  1. Open both files in ‘rb’
    Read the files
    Create Page Object

2.
fuse both objects together
»>pageObj1.mergePage(pageObj2)

3.
Write a new PDF File and use *.addpage(pageObj1)
_____________
# 1. (repeat steps for second file)
minutesFile = open(‘meetingminutes.pdf’, ‘rb’)
pdf1Reader = PyPDF2.PdfFileReader(minutesFile)
minutesFirstPage = pdf1Reader.getPage(0)

# 2.
minutesFirstPage.mergePage(waterFirstPage)
# 3.
pdfWriter = PyPDF2.PdfFileWriter()
pdfWriter.addPage(minutesFirstPage)

resultPdfFile = open(‘watermarkedCover.pdf’, ‘wb’)
pdfWriter.write(resultPdfFile)
minutesFile.close()
resultPdfFile.close()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How can you encrypt PDF Files?

A

add *.encrypt(‘password) to the pdfWriter variable

____
&raquo_space;> import PyPDF2
&raquo_space;> pdfFile = open(‘meetingminutes.pdf’, ‘rb’)
&raquo_space;> pdfReader = PyPDF2.PdfFileReader(pdfFile)
&raquo_space;> pdfWriter = PyPDF2.PdfFileWriter()
&raquo_space;> for pageNum in range(pdfReader.numPages):
pdfWriter.addPage(pdfReader.getPage(pageNum))

➊&raquo_space;> pdfWriter.encrypt(‘swordfish’)
&raquo_space;> resultPdf = open(‘encryptedminutes.pdf’, ‘wb’)
&raquo_space;> pdfWriter.write(resultPdf)
&raquo_space;> resultPdf.close()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the structure of binary Word Documents?

A

They have 3 layers:

  1. Document Object (whole doc)
  2. paragraphs (whenever user presses Enter)
  3. Runs (Each paragraph contains one or more)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How can you read:

  1. how many paragraphs a word doc has
  2. the content of the doc´s first paragraph
  3. how many runs the second paragraph holds
  4. the content of the sencond paragraphs first run
A

> > > import docx
doc = docx.Document(‘doc.docx’)

  1. how many paragraphs a word doc has
    »> len(doc.paragraphs)
  2. the content of the doc´s first paragraph
    »> doc.paragraphs[0].text
  3. how many runs the second paragraph holds
    len(doc.paragraphs[2].runs)
  4. the content of the second paragraphs first run
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How can you get the full Text from a file?

A

E.g. by creating a function which can be called later:

import docx
def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for pin doc.paragraphs:
        fullText.append(p.text)
    return '\n'.join(fullText)

save as: readDocx -> CALL:
import readDocx
print(readDocx.getText(‘d.docx’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How can you style:

  1. Paragrahps
  2. runs

Name 3x style attributes

A
  1. Paragrahps: *.style = ‘attribute’
    »>doc.paragraphs[0].style = ‘Normal’
  2. runs *.style = ‘attributeChar
    »>doc.paragraphs[1].runs[0].style = ‘QuoteChar’

3x Attributes:
‘Body Text’
‘Heading 1’ (up to 9)
‘List Bullet’

Runs have more attributes with boolean values, e.g.:
*.bold = True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How can you write 2x paragraphs to a word document and later on add text to the first paragraph?

A

> > > import docx
doc = docx.Document()
doc.add_paragraph(‘Hello world!’)

> > > paraObj1 = doc.add_paragraph(‘This is a second paragraph.’)
paraObj2 = doc.add_paragraph(‘This is a yet another paragraph.’)
paraObj1.add_run(‘ This text is being added to the second paragraph.’)

> > > doc.save(‘multipleParagraphs.docx’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do you add a header to a word doc?

A

> > > doc.add_heading(‘Header’, 0)

Header numbers 0 to 4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Add a line and a page break to text in a word doc

A
Line break:
use add_break() on run object

Page break:
»> doc = docx.Document()
»> doc.add_paragraph(‘This is on the first page!’)
»> doc.paragraphs[0].runs[0].add_break(docx.enum.text.WD_BREAK.PAGE)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do you add pictures?

A

> > > doc.add_picture(‘zophie.png’, width=docx.shared.Cm(1),
height=docx.shared.Cm(4))

You can use cm or inches