Image Processing and OCR Flashcards
(25 cards)
What does OCR stand for
Optical Character Recognition
OCR turns text into?
image-based content into machine-readable text
What are the 3 OCR Engines that come with all Grooper installs
Tesseract OCR
Transym 4 OCR
Transym 5 OCR
Matrix matching and feature recognition are part of what phase of an OCR engine’s operation?
Character Recognition
Breaking up pixels into lines, words, and characters is part of what phase of an OCR engine’s operation?
Segmenting
Many OCR engines spell check OCR results to improve
their accuracy. This is part of what phase of an OCR
engine’s operation?
Post-Processing
In your own words, describe the Segmenting phase of an
OCR engine.
This is when the pixels are broken up into lines, individual words, and
characters
OCR engines that obtain results by comparing a grid of
pixels on an image to a grid of pixels of examples of
characters are performing….
Matrix Matching
The Grooper activity that performs OCR is….
Recognize
What image processing operation is required for an OCR
engine to obtain results, either through Grooper’s image
processing suite or via the OCR engine itself?
thresholding (or binarizing) the image
Image processing in Grooper serves one (1) of three (3)
basic purposes. What are they?
Archival Adjustments also OCR Cleanup and Layout Data collection (Archival Adjustments
ONLY pertain to permanent image processing via the Image
Processing activity)
The Grooper activity that performs permanent image
processing is….
Image Processing
In your own words, what is the benefit of performing
temporary image processing? How do you perform
temporary image processing in Grooper?
Temporary Image Processing is great because it will not make permanent
changes to the document itself. You assign a Temp IP Profile and run the
recognize activity. The only thing I will add is where that temporary IP
Profile gets assigned. It is assigned on the OCR Profile (which then
gets executed by the Recognize activity).
List three (3) common IP Commands used during permanent image processing.
Auto Deskew, Auto Border Crop, Rotate
List three (3) common IP Commands used during temporary image processing.
Line Removal, Speck Removal, Negative Region Removal
Grooper’s set of properties that pre-process and reprocess
the OCR engine’s results are called…
Synthesis
There are five (5) operations that comprise this
synthesis functionality. List them
Font Pitch Detection, Bound Region Processing, Iterative Processing, Cell
Validation, Segment Reprocessing
Where are the synthesis properties enabled and configured?
ON an OCR Profile
In your own words, what is “fuzzy regular expression”?
How does “fuzzy regular expression” improve Grooper’s
ability to extract data from poorly OCR’d pages?
This allows you to match expressions and set a percent match of how
close it looks like what you are trying to find, this helps eliminate errors
when extracting data.
How do you alter the normal cost to swap characters
when using fuzzy regular expression?
Fuzzy Match Weightings
How do you force a portion of a fuzzy regular expression
to match normally (or non-fuzzily)?
Required Mode
Non-text information obtained via permanent or
temporary image processing such as line locations,
checkbox locations and states, barcode values, and
detection of trained shapes is referred to as…
Layout Data
Once Non-text information obtained via permanent or
temporary image processing such as line locations,
checkbox locations and states, barcode values, and
detection of trained shapes is collected, where is this information stored in Grooper?
The LayoutData.json file. It is stored on each page object a Layout Data IP Command locates non-text data.
What “tab marking” property will insert a tab character
(“\t”) between the highlighted values in the table below,
without adjusting the width of a tabbed space?
Detect Lines