Bug 690768 - Invalid PDF/A with Umlaut, parentheses, special characters or date in PDFA_def.ps document properties
Summary: Invalid PDF/A with Umlaut, parentheses, special characters or date in PDFA_de...
Status: RESOLVED WORKSFORME
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Writer (show other bugs)
Version: 8.71
Hardware: PC Windows XP
: P4 normal
Assignee: Ken Sharp
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-09-17 07:23 UTC by T. Fischer
Modified: 2010-09-21 15:56 UTC (History)
0 users

See Also:
Customer:
Word Size: ---


Attachments
PDFA_def.ps (with Umlaut in Title, Subject, Keywords) (2.97 KB, application/postscript)
2009-10-06 06:27 UTC, T. Fischer
Details
aPDFAtest.zip (4.20 MB, application/binary)
2010-05-19 07:12 UTC, T. Fischer
Details
aPDF2test.pdfa_report2.pdf (1.60 MB, application/x-pdf)
2010-05-19 07:38 UTC, T. Fischer
Details

Note You need to log in before you can comment on or make changes to this bug.
Description T. Fischer 2009-09-17 07:23:03 UTC
Acrobat 9.1.3 Pro Preflight reports errors with:
1.) Umlaut, parentheses, special characters (e.g. brackets) in pdfmark (Title,
Subject, ...) and
2.) ModDate and CreationDate as
inconsistent XMP-Metadata with document info (respectively document properties)

see also:
[gs-devel] PDF/A xmp/document properties escaping
http://www.ghostscript.com/pipermail/gs-devel/2009-May/008375.html
When the Title or Subject contains braces '(' or ')' they get escaped by '\('
and '\)' respectively.
They display correctly unescaped in the document properties/Additional
Metadata/Extended tree, But when executing preflight validation of the document,
it complains about inconsistent document properties and XMP data in the Title
and Subject fields.


PDFA_def.ps template:
http://www.ghostscript.com/pipermail/gs-devel/2009-January/008083.html
Extract:
% Define entries to the document Info dictionary :
/ICCProfile (ISOcoated_v2_300_eci.icc)   % Customize.
def
[ /Title (Test Title bzw. Titel ©(äöüßÄÖÜ))
% The title as shown in Preflight-Metadata: "Test Title bzw. Titel
\251\(\344\366\374\337\304\326\334\)"
% BTW: The title shows up alright in the Acrobat-Dialog:
% File->Properties...->Additional Metadata...->Advanced->XMP Core-Properties
% Datei->Eigenschaften...->Zusätzliche Metadaten...->Erweitert->XMP
Core-Eigenschaften
  /Subject (Test Subject bzw. Thema bzw. IPTC Inhalt Beschreibung)
  /Author (Test Author bzw. Verfasser Mr. X (c), © Copyright Symbol)
%Verfasser(optional)
  /Producer (Test Producer bzw. erzeugt mit demo-software)
  /Keywords (Test Keywords bzw. Stichwörter, comma, separated)
  /Creator (Test Creator bzw. erstellt mit (©) GPL Ghostscript 8.70 PDF Writer)
% 4 errors with Acrobat 9.1.3 Pro Preflight: (German Acrobat Version)
% Creator unterschiedlich in Dokument-Eigenschaften und XMP-Metadaten
% Stichwort nicht einheitlich in Dokument-Info und XMP-Metadaten
% Uneinheitliche Angaben zum Autor in Dokument-Eigenschaften und XMP-Metadaten
% Uneinheitliche Angaben zum Titel in Dokument-Eigenschaften und XMP-Metadaten
%  /CreationDate (D:20090917) %no time allowed (problem with timezones), e.g.
D:20090917152755+0100 or D:20090917152755Z
%  /ModDate (D:200808080808Z) % all three date formats are not validated with
Acrobat 9.1.3 Pro Preflight:
%  Error message: inconsistent XMP-Metadata with document info, respectively
document properties (German Acrobat Version)
%   - Die Angaben zum Erzeugungsdatum in Dokument-Eigenschaften und
XMP-Metadaten ist nicht einheitlich
%   - Letztes Änderungsdatum nicht einheitlich in Dokument-Info und XMP-Metadaten
  /DOCINFO pdfmark


Command used to create PDF/A:
gswin32c.exe -dPDFA -dBATCH -dNOPAUSE -dNOOUTERSAVE -sFONTPATH=C:\WINDOWS\Fonts
-dUseCIEColor -sProcessColorModel=DeviceCMYK -sDEVICE=pdfwrite
-sOutputFile=out.pdf PDFA_def.ps a-Eff.pdf

uses ISOcoated_v2_300_eci.icc from ECI_Offset_2009:
http://www.eci.org/doku.php?id=de:downloads
Comment 1 T. Fischer 2009-10-05 05:52:22 UTC
Attached demo PDF to reproduce the PDF/A-errors to Bug 690803.
Comment 2 Ken Sharp 2009-10-06 01:33:51 UTC
This is, or may be, expected behaviour. Please see gs/doc/ps2pdf.htm under the
-sDSCEncoding switch.

If you do not specify a value for this then there are two problems. Firstly the
parentheses remain escaped (which is required for PostScript strings), secondly
the data is copied directly to the output, including any octal escapement which
is required for PostScript.

The first issue is resolved with revision 10142:

http://ghostscript.com/pipermail/gs-cvs/2009-October/009866.html

Which 'unescapes' data. Note that octal escapes will be converted into single
byte 'binary' data. This may or may not work with the other characters you
describe (Umlauts), you haven't attached a PDFA_def file to work with and I do
not trust cut and paste from HTML, so I can't be sure.

If it does not work then you might like to try setting DSCDocEncoding to
PDFDocEncoding which will convert the characters into Unicode, using the
PDFDocEncoding to decide which characters are represented by the binary values.

I do not see a problem with the ModDate or CreationDate when Acrobat preflight
is applied.

Since the parentheses and escapement issue is dealt with under bug #690471, the
other 'special' characters are probably dealt with under the DSCDocEncoding and
I don't see a problem with the dates. closing this as 'worksforme'.

If you continue to see a problem please reopen the issue and attach an example
file and an example PDFA_def.ps file.
Comment 3 T. Fischer 2009-10-06 06:27:45 UTC
Created attachment 5454 [details]
PDFA_def.ps (with Umlaut in Title, Subject, Keywords)
Comment 4 T. Fischer 2009-10-06 06:36:46 UTC
Thank you! No more problems with the brackets using the parameter
"-sDSCEncoding=PDFDocEncoding".

Cannot check the umlaute (הצ�ִײ��) since my version 8.70 (2009-07-31) is before
October 2009. Could not find a nightly or developer build.

Ghostscript command used:

gswin32c.exe -dPDFA -dBATCH -dNOPAUSE -dNOOUTERSAVE -sFONTPATH=C:\WINDOWS\Fonts
-dUseCIEColor -sProcessColorModel=DeviceCMYK -sDEVICE=pdfwrite
-dPDFACompatibilityPolicy=1 -sDSCEncoding=PDFDocEncoding -sOutputFile=aOut.pdf
PDFA_def.ps a.pdf
Comment 5 Ray Johnston 2009-10-06 06:53:32 UTC
Since the only 'build' (executable) we ever distribute is for windows, this
comment assumes that you need help building on Windows.

The current sources are available from our repository svn.ghostscript.com
The web page at that host has the overview. Check out the source using:

svn co http://svn.ghostscript.com/ghostscript/trunk/gs

The TortoiseSVN svn client is used by several Artifex staff on Windows.

Then build it from a MS-DOS prompt window by changing to the top 'gs' directory
and using:

nmake -f psi/msvc32.mak

The resulting binary .exe and .dll will be in 'bin'

Note that the free 'Visual Studio C++ Express' from Microsoft is all you need
to build Ghostscript (if you don't already have MSVC).
Comment 6 Ken Sharp 2009-10-06 08:19:24 UTC
If you don't select an Encoding using DSCEncoding, then the data written to the
XMP section is incorrect, even with the escapement changes. 

Although the characters are no longer escaped, they is required to be in Unicode
for the XMP section, and unless already is Unicode will be written incorrectly
as the string is written without conversion.

Even if it is Unicode, the PDF Metadata (/Title etc) is not, its encoded using
PDFDocEncoding. So its probably best to set -sDSCEncoding=PDFDocEncoding
whenever you use any characters outside the regular 7-bit ASCII set.

Comment 7 T. Fischer 2010-05-19 07:12:12 UTC
Created attachment 6297 [details]
aPDFAtest.zip
Comment 8 T. Fischer 2010-05-19 07:23:46 UTC
The attached file aPDFAtest.zip reproduces the Preflight-error described in comment 0 (and has the additional two Preflight-errors from Bug 691319, too).

In the ZIP-archiv you find:
- PDFA_defUmlaut.ps (like the attached PDFA_def.ps, only the gs version no changed in the text)
- aPDFtest.bat (the gs-command used)
- ISOcoated_v2_300_eci.icc (for completeness)
- aPDF2test.pdf (just a PDF file, taken from Bug 690803)
- aPDF2test.pdfa.pdf (the PDF/A produced with Ghostscript V.8.71 / aPDFtest.bat)
- aPDF2test.pdfa_report.pdf (the Preflight V.9.2 error report)
Comment 9 T. Fischer 2010-05-19 07:37:30 UTC
if you uncomment the two following lines in PDFA_defUmlaut.ps:
  /CreationDate (D:200808080808Z)
  /ModDate (D:200808080808Z)

then Preflight reports two additional errors:
ModDate and CreationDate: inconsistent XMP-Metadata with document info (respectively document properties)

see attachment aPDF2test.pdfa_report2.pdf
Comment 10 T. Fischer 2010-05-19 07:38:24 UTC
Created attachment 6298 [details]
aPDF2test.pdfa_report2.pdf
Comment 11 Ken Sharp 2010-09-21 15:45:25 UTC
(In reply to comment #9)
> if you uncomment the two following lines in PDFA_defUmlaut.ps:
>   /CreationDate (D:200808080808Z)
>   /ModDate (D:200808080808Z)
> 
> then Preflight reports two additional errors:
> ModDate and CreationDate: inconsistent XMP-Metadata with document info
> (respectively document properties)

This is a really *bad* idea. The CreationDate and ModDate are normally filled in at the time the document is created. You can override them in PostScript, but you can't override the XMP metadata creation that way.

As a result the two will not match if you do that. It is not intended that it is possible to create a false CreationDate and/or ModDate by using PostScript.
Comment 12 Ken Sharp 2010-09-21 15:56:00 UTC
(In reply to comment #8)
> The attached file aPDFAtest.zip reproduces the Preflight-error described in
> comment 0 (and has the additional two Preflight-errors from Bug 691319, too).

Executing the aPDFtest.bat (with suitable alteration to the path for GS), using current source code, the file passes Acrobat 9 pre-flight without error (except for the TT issue noted below).

I've checked the Title, Subject and Author fields in both the Info dictionary and the XMP metadata. The characters with umlauts, the parentheses and the copyright symbol are all present in both sets of metadata, and appear to match. Given that the preflight doesn't complain I'm inclined to believe they do match.

The issue with Encodings being applied to Symbolic TrueType fonts already has several bug reports against it (#690744, #691036, #691319).

Closing the issue as worksforme, as the original issue was fixed in rev 10142 and by setting DSCEncoding to PDFDocEncoding, the TrueType issue is being separately tracked, and the attempt to modify dates is not supported.