690820 – File Identifiers are written as literal string where hexadecimal representation ist expected

Bug 690820 - File Identifiers are written as literal string where hexadecimal representation ist expected

Summary: File Identifiers are written as literal string where hexadecimal representati...

Status:	RESOLVED INVALID

Alias:	None

Product:	Ghostscript
Classification:	Unclassified
Component:	PDF Writer (show other bugs)
Version:	8.70
Hardware:	PC Linux

Importance:	P4 normal
Assignee:	Ken Sharp

URL:
Keywords:

Depends on:
Blocks:

Reported:	2009-10-15 05:09 UTC by Gerhard Wanderer
Modified:	2009-10-16 00:41 UTC (History)
CC List:	0 users

See Also:
Customer:
Word Size:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Gerhard Wanderer 2009-10-15 05:09:13 UTC

When creating PDF/X compatible PDF files from postsript files, sometimes the
File Identifier is written as string like '(text)', most of the time it is
written as bytestring '<812FDA7524C3467318D2A0B185604AD0>'.

According to ISO 32000-1:2008, section 7.5.5:

(Required if an Encrypt entry is present; optional otherwise; PDF 1.1) An array
of two byte-strings constituting a file identifier (see 14.4, "File
Identifiers") for the file. If there is an Encrypt entry this array and the two
byte-strings shall be direct objects and shall be unencrypted.

So some programs expect the bytestring, even when the file is not encrypted. The
problem is in a signature-server, which is used to sign the pdf. There is no
problem with viewers, like Acrobat Reader.

At the moment I have modified the ghostscript source to ensure hexadecial output
of the file identifier:


diff -E -w -r ghostscript-8.70/base/gdevpdf.c ghostscript-8.70,org/base/gdevpdf.c
1346,1347c1346,1347
<     psdf_write_string(pdev->strm, pdev->fileID, sizeof(pdev->fileID), PRINT_HEX);
<     psdf_write_string(pdev->strm, pdev->fileID, sizeof(pdev->fileID), PRINT_HEX);
---
>     psdf_write_string(pdev->strm, pdev->fileID, sizeof(pdev->fileID), 0);
>     psdf_write_string(pdev->strm, pdev->fileID, sizeof(pdev->fileID), 0);

diff -E -w -r ghostscript-8.70/base/spsdf.c ghostscript-8.70,org/base/spsdf.c
78,79c78
<     if ((print_ok != PRINT_HEX) &&
<               (added < size || (print_ok & PRINT_HEX_NOT_OK))) {
---
>     if (added < size || (print_ok & PRINT_HEX_NOT_OK)) {

diff -E -w -r ghostscript-8.70/base/spsdf.h ghostscript-8.70,org/base/spsdf.h
34d33
< #define PRINT_HEX        8

Comment 1 Ken Sharp 2009-10-15 06:30:21 UTC

>When creating PDF/X compatible PDF files from postsript files, sometimes the
>File Identifier is written as string like '(text)', most of the time it is
>written as bytestring '<812FDA7524C3467318D2A0B185604AD0>'.

That is not a 'byte string' as defined in the PDF Reference, that is a
Hexadecimal string. I don't have a copy of the ISO spec, so I'm working from the
PDF Reference 1.7, quotes are from that document.

on p155, Table 3.31 PDF Data Types:

"byte string A series of 8-bit bytes that represent characters or other binary
data. If such a type represents characters, the encoding is not identified."

This is detailed further on p157:

"byte string (PDF 1.7) Used for binary data represented as a series of 8-bit
bytes, where each byte can be any value representable in 8 bits. The string may
represent characters or glyphs but the encoding is not known. The bytes of the
string may not represent characters. This type is used for data such as MD5 hash
values, signature certificates, and Web Capture identification values. "

So a byte string is a binary sequence, it is not a hexadecimal encoded sequence.

The File Identifier is documented on p847:

"File identifiers are defined by the optional ID entry in a PDF file’s trailer
dictionary (see Section 3.4.4, “File Trailer”; see also implementation note 162
in Appendix H). The value of this entry is an array of two byte strings."

So it seems from this that an ID which is hexadecimal encoded would actually be
*incorrect*.

>So some programs expect the bytestring, even when the file is not encrypted.
>The problem is in a signature-server, which is used to sign the pdf. There is
>no problem with viewers, like Acrobat Reader.

Its not clear to me what the problem is, you seem to be saying that an
application doesn't like these strings unless they are hex encoded, which seems
to be incorrect. You also seem to infer that the strings are hex encoded when
the file is Encrypted, but not otherwise.

I'm doubtful this is the case, I suspect that a stream or other content object,
when encrypted, might be hex encoded and that object contains an array of
strings which are not hex-encoded when decrypted. When not encrypted the array
is an array of byte strings, as expected.

Currently it seem to me that the application complaining about the ID is
incorrect, but without examples it is very difficult to be sure. Note that
Acrobat Preflight, amongst other tools, is known to happily validate GS output
as PDF/X conforming.

Note that the presence or absence of encryption does not affect whether the
strings are byte strings or not, they are *always* byte strings. The presence of
encryption simply means that these strings are mandatory, not optional.


If you still believe there is a problem please attach some examples which will
allow us to reproduce your problem. We will need an example input file (or
files) and example command line specifications.

Comment 2 Gerhard Wanderer 2009-10-15 09:16:59 UTC

Thank you for your fast reply.

I now think, it's not a bug nor a wheak interpretation of the spec in 
ghostscript. In the few sentences of the excerpt of the spec, that I got, there
is no statment, that the string has to be written in a hexadecimal representation. 

I was missleaded by 'direct object', but that has nothing to do with the kind 
of the string output.

Sincerely yours

> That is not a 'byte string' as defined in the PDF Reference, that is a
> Hexadecimal string. I don't have a copy of the ISO spec, so I'm working
> from the PDF Reference 1.7, quotes are from that document.
>
The word 'bytestring' was my naming for hexadecimal encoded sting, sorry for 
the confusion.

In the case, where the pdf is not accepted, the created trailer looks like 

trailer
<< /Size 19 /Root 1 0 R /Info 2 0 R
/ID [([\246\347;J\243RU~\375Nw,B`c)([\246\347;J\243RU~\375Nw,B`c)]
>> >>
startxref
28887
%%EOF

When it is accepted, it looks like this:
trailer
<< /Size 40 /Root 1 0 R /Info 2 0 R
/ID [<BEBD81F87396DC4BB86CF445CA694474><BEBD81F87396DC4BB86CF445CA694474>]
>>
startxref
492248
%%EOF

Comment 3 Ken Sharp 2009-10-16 00:41:40 UTC

OK when encrypted all strings and streams in the PDF file (but not other objects
such as integers and booleans) are encrypted.

What you have is an unencrypted dictionary, with an unencrypted array, which
contains encrypted strings. 

The spec doesn't actually explicitly say so (unless the ISO spec says something
different) but the encryption in effect trumps the byte string requirement. This
is because decryption must take place before any other use of the string. So
after the strings have been unencrypted they will in fact be binary strings
again, and therefore acceptable as File IDs.