692772 – After generation of PDF/A text no more searchable

Bug 692772 - After generation of PDF/A text no more searchable

Summary: After generation of PDF/A text no more searchable

Status:	NOTIFIED WONTFIX

Alias:	None

Product:	Ghostscript
Classification:	Unclassified
Component:	PDF Writer (show other bugs)
Version:	9.04
Hardware:	All All

Importance:	P4 normal
Assignee:	Ken Sharp

URL:
Keywords:

Depends on:
Blocks:

Reported:	2012-01-05 10:02 UTC by artifex
Modified:	2012-04-16 19:17 UTC (History)
CC List:	0 users

See Also:
Customer:	870
Word Size:	---

Attachments
DIN.pdf (119.54 KB, application/pdf) 2012-01-05 10:02 UTC, artifex	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description artifex 2012-01-05 10:02:04 UTC

Created attachment 8249 [details]
DIN.pdf

When the attached PDF-file DIN.pdf is converted to PDF/A, the text is no more searchable. 

GS-call:

gswin32c.exe -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -o pdfa.pdf -dUseCIEColor -sProcessColorModel=DeviceCMYK -dPDFA PDFA_def.ps DIN.pdf

Comment 1 Ken Sharp 2012-01-05 10:32:51 UTC

The original file includes symbolic TrueType fonts with /Encoding and /Differences arrays, this is contrary to the recommendations of the PDF specification.

There is no other information on the glyphs in the file, no ToUnicode CMap, and the fonts are encoded in a non-standard fashion.

When we create a PDF/A output file we may *NOT* include an Encoding with a symbolic TrueType font, as the specification is quite specific that this is disallowed (see section 6.3.7 of the specification), and various PDF/A validators *will* reject such a file as invalid. In fact, the file you have sent claims to be a PDF/A file but fails validation with Acrobat's preflight tool for this reason (amongst others).

Its true that in the past we did permit this, but precisely because it causes problems we no longer do so.

In the absence of proper glyph information there is no way we can embed a ToUnicode CMap, and since the fonts are encoded in a non-standard way, there is no information for Acrobat to use in order to perform searches.