Settings->Text Extraction



The options on this tab determine how to save the text which has been extracted from printable documents.
If a source document contains textual elements, Zan Image Printer can extract and save the text to a separate text file in addition to creating the image files.

Text extraction is not optical character recognition (OCR) based; only plain text in the source document is extracted. In other words, Zan Image Printer cannot extract text from scanned documents.

Zan Image Printer is able to extract text from most documents, but the text extracted from some documents (for example, PDF) may be just a string of question marks (?) or even blank lines instead of the actual text in the document. This can be caused by the printing application rasterizing the text within the document into a bitmap prior to sending it to the printer.

To reliably extract text from PDF documents, you can use pdftotext from the xpdf package. pdftotext is a command line program that can convert entire PDF documents or individual pages to plain text while maintaining layout. The extracted text can be converted to a wide choice of standard encodings including UTF-8 and ASCII.
To install pdftotext:
1. Make a folder on your computer, for example c:\xpdf
2. Go to http://www.foolabs.com/xpdf/download.html, scroll down to "Precompiled binaries" and you will see a paragraph starting with "x86, DOS/Win32", click the Win32 download link in that paragraph and download the file to c:\xpdf, then unzip all the files.

Examples (read the accompanying pdftotext.txt file for more usage information):
pdftotext filename.pdf, this usage produces a text file (filename.txt) with the same name as the input file.
pdftotext filename.pdf -layout layout.txt, this usage produces a text file layout.txt and maintains original physical layout.
pdftotext filename.pdf -enc UTF-8, this usage produces a text file in UTF-8 encoding format.

Extract text to file
Determines whether you want to enable the text extraction feature.

Suppress the creation of image files
If checked, Zan Image Printer will operate in text extraction only mode and no image files will be generated.

Tip: When in text extraction only mode, we recommend that you use a low DPI like 100 x 100 and print in Black & White mode, these optimizations can speed up the printing process greatly without affecting the quality of the output.

File Encoding
This option specifies the encoding of the generated text file.

ASCII encoding provides for 7 bit characters and only supports the first 128 Unicode characters. All characters outside that range will be displayed as unknown symbols.

Unicode is a worldwide character-encoding standard published by the Unicode Consortium and provides a unique number for every character.

UTF is the acronym for Unicode Transformation Format. UTF uses bit-shifting techniques to encode Unicode characters, UTF-8 encodes each Unicode character as variable number of bytes (usually 3), UTF-16 always encodes each Unicode character as two bytes.

Big-endian and little-endian are terms that describe the order in which a sequence of bytes are stored in memory. Big-endian is an order in which the "big end" (most significant value in the sequence) is stored first. Little-endian is an order in which the "little end" (least significant value in the sequence) is stored first.

Valid file encoding options:
ANSI

Unicode
Or UTF-16LE, Unicode little endian, UTF-16.

Unicode big endian
Or UTF-16BE

UTF-8


Insert Line Breaks
This option controls the line ending convention of the generated text file.
A line break in computer text is represented by one or more invisible characters. On Unix systems, the break is the single line feed "\n", Windows uses both characters together: carriage return and line feed "\r\n".
Windows CR LF
Use "\r\n" as the line ending.
UNIX LF
Use "\n" as the line ending.
None
No line ending.

Format
This option specifies how to format the extracted text.
Formatted text
With this option, Zan Image Printer generates a text file that closely matches the original document, retaining the document layout as best as possible. The extracted text may not have the same formatting as the source document.
Plain text
Save as plain text (without formatting).

Append text to existing file
If this option is unchecked, Zan Image Printer overwrites the text file if it already exists.
If checked, the new text is appended to the existing file.


Formfeed after every page
Select this option to insert a form feed character (ASCII 0x0C) after every page.

Prefix Byte-Order Mask (BOM) to file
Specify whether you want to insert the BOM (Byte Order Mark) sequence of bytes to the generated text file.
BOM is an encoding signature for the file, a particular sequence of bytes at the beginning of the file that indicates the encoding and byte order, so that an application can use the BOM to determine the text file's encoding. Without BOM, the application has to analyze the text and guess which encoding the file actually uses.
UTF-8 - EF BB BF
Unicode little endian - FF FE
Unicode big endian - FE FF

Browse
Enter the text file name into the edit box or click the Browse button to select a text file. You can also manually enter save macro commands into the edit box to form the text file name. Enter a fixed name if you want to save all texts from the whole document to a single file, use the page number macro if you want to save each page as a separate file.
If the edit box is left blank, the text file will be created in the same folder and have the same name (except for the extension) as the image file.


Help
Loads the help file, and displays the Settings->Text Extraction topic.