Settings->Text Extraction
The options on this tab determine how to save the text which has been
extracted from printable documents.
If a source document contains textual elements, Zan Image Printer can extract and save the text
to a separate text file in addition to creating the image files.
Text extraction is not optical character recognition (OCR) based;
only plain text in the source document is extracted. In other words, Zan Image Printer
cannot extract text from scanned documents.
Zan Image Printer is able to extract text from most documents, but the text extracted from some
documents (for example, PDF) may be just a string of question marks (?) or even blank lines
instead of the actual text in the document. This can be caused by the printing application
rasterizing the text within the document into a bitmap prior to sending it to the printer.
To reliably extract text from PDF documents, you can use pdftotext from the xpdf package.
pdftotext is a command line program that can convert entire PDF documents or individual pages
to plain text while maintaining layout. The extracted text can be converted to a wide choice of
standard encodings including UTF-8 and ASCII.
To install pdftotext:
1. Make a folder on your computer, for example c:\xpdf
2. Go to http://www.foolabs.com/xpdf/download.html,
scroll down to "Precompiled binaries" and you will see a paragraph starting with "x86, DOS/Win32", click
the Win32 download link in that paragraph and download the file to c:\xpdf, then unzip all the files.
Examples (read the accompanying pdftotext.txt file for more usage information):
pdftotext filename.pdf, this usage produces a text file (filename.txt) with the same name as the input file.
pdftotext filename.pdf -layout layout.txt, this usage produces a text file layout.txt and maintains original physical layout.
pdftotext filename.pdf -enc UTF-8, this usage produces a text file in UTF-8 encoding format.
Extract text to file
Determines whether you want to enable the text extraction feature.
Suppress the creation of image files
If checked, Zan Image Printer will operate in text extraction only mode and no image
files will be generated.
Tip: When in text extraction only mode, we recommend that you use a low DPI
like 100 x 100 and print in Black & White mode,
these optimizations can speed up the printing process greatly without affecting the quality of the output.
File Encoding
This option specifies the encoding of the generated text file.
ASCII encoding provides for 7 bit characters and only supports the first 128 Unicode characters.
All characters outside that range will be displayed as unknown symbols.
Unicode is a worldwide character-encoding standard published by the Unicode Consortium
and provides a unique number for every character.
UTF is the acronym for Unicode Transformation Format. UTF uses bit-shifting techniques
to encode Unicode characters, UTF-8 encodes each Unicode character as variable
number of bytes (usually 3), UTF-16 always encodes each Unicode character as two bytes.
Big-endian and little-endian are terms that describe the order in which a sequence
of bytes are stored in memory. Big-endian is an order in
which the "big end" (most significant value in the sequence) is stored first.
Little-endian is an order in which the "little end" (least significant value in
the sequence) is stored first.
Valid file encoding options:
ANSI
Unicode
Or UTF-16LE, Unicode little endian, UTF-16.
Unicode big endian
Or UTF-16BE
UTF-8
Insert Line Breaks
This option controls the line ending convention of the generated text file.
A line break in computer text is represented by one or more invisible characters. On Unix systems, the break is the single line feed "\n", Windows uses both characters together: carriage return and line feed "\r\n".
Windows CR LF
Use "\r\n" as the line ending.
UNIX LF
Use "\n" as the line ending.
None
No line ending.
Format
This option specifies how to format the extracted text.
Formatted text
With this option, Zan Image Printer generates a text file
that closely matches the original document, retaining
the document layout as best as possible. The extracted text may
not have the same formatting as the source document.
Plain text
Save as plain text (without formatting).
Append text to existing file
If this option is unchecked, Zan Image Printer overwrites the text file if it already exists.
If checked, the new text is appended to the existing file.
Formfeed after every page
Select this option to insert a form feed character (ASCII 0x0C) after every page.
Prefix Byte-Order Mask (BOM) to file
Specify whether you want to insert the BOM (Byte Order Mark) sequence of bytes to the generated text file.
BOM is an encoding signature for the file, a particular sequence of bytes at the beginning of
the file that indicates the encoding and byte order, so that an application can use the BOM to
determine the text file's encoding. Without BOM, the application has to analyze the text and guess
which encoding the file actually uses.
UTF-8 - EF BB BF
Unicode little endian - FF FE
Unicode big endian - FE FF
Browse
Enter the text file name into the edit box or click the Browse button to
select a text file. You can also manually enter save macro commands
into the edit box to form the text file name. Enter a fixed name if you want to save all texts
from the whole document to a single file, use the page number macro if you want to save each page
as a separate file.
If the edit box is left blank, the text file will be created in the same folder
and have the same name (except for the extension) as the image file.
Help
Loads the help file, and displays the Settings->Text Extraction topic.