690722 – Problem with the font embedding logic

Bug 690722 - Problem with the font embedding logic

Summary: Problem with the font embedding logic

Status:	RESOLVED FIXED

Alias:	None

Product:	Ghostscript
Classification:	Unclassified
Component:	PDF Writer (show other bugs)
Version:	8.70
Hardware:	PC Linux

Importance:	P4 normal
Assignee:	Ken Sharp

URL:
Keywords:

Depends on:
Blocks:

Reported:	2009-08-20 01:43 UTC by Luigi Scarso
Modified:	2009-08-25 01:19 UTC (History)
CC List:	1 user (show)

See Also:
Customer:
Word Size:	---

Attachments
gs870.zip (27.81 KB, application/zip) 2009-08-20 01:45 UTC, Luigi Scarso	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Luigi Scarso 2009-08-20 01:43:06 UTC

With ghostscript 8.70,
processing test.pdf in attachment with
this script
#
GS=/opt/ghostscript/ghostscript-8.70/bin/gs
pdf_src=$1
dest=`basename $pdf_src .pdf`_gs870.pdf
$GS -sDEVICE=pdfwrite -dNOPAUSE -dCompressFonts=false
-dCompressPages=false  -dUseFlateCompression=false -sOutputFile=$dest
$pdf_src -c quit
#
produce test_gs870.pdf without errors.

But If I view test_gs870.pdf with xpdf, I see on console many of this "errors"
Error: Illegal entry in bfrange block in ToUnicode CMap

Both pdfs are seen correctly .
I'm actually under a Ubuntu Linux 8.04 LTS .

Comment 1 Luigi Scarso 2009-08-20 01:45:42 UTC

Created attachment 5316 [details]
gs870.zip

distill_870.sh test_gs870.pdf  test.pdf

Comment 2 Ken Sharp 2009-08-20 03:29:39 UTC

This appears to be an error with xpdf, not Ghostscript. The original file
includes a ToUnicode CMap which uses the ~bfchar operators:

0 beginbfrange
endbfrange
5 beginbfchar
<0004> <0021>
<002B> <0048>
<0048> <0065>
<004F> <006C>
<0052> <006F>
endbfchar

The output from pdfwrite encodes the same information but uses the ~bfrange
operators instead:

5 beginbfrange
<04><04><0021>
<2b><2b><0048>
<48><48><0065>
<4f><4f><006c>
<52><52><006f>
endbfrange

Now this is nominally slightly less efficient, since it requires a start and end
range, but the spec does not say (Technical note 5014 Adobe CMap and CIDFont
files specification) that these cannot be the same (p72 and 73 of the spec).

In fact it appears that xpdf is assuming that the start and end codes will be 4
bytes long. If I modify the entries thus:

5 beginbfrange
<0004><0004><0021>
<002b><002b><0048>
<0048><0048><0065>
<004f><004f><006c>
<0052><0052><006f>
endbfrange

The problem disappears. The spec says:

"Values for srcCodeLo and srcCodeHi must be in hexadecimal notation. "

There is no apparent requirement for null padding, so this requirement by xpdf
would seem to be incorrect. Further, Adobe Acrobat is capable of opening both
files *and* searching for the text in both cases. Since Acrobat requires the
ToUnicode table to search for text in a CIDFont it would seem that Adobe do not
require this either.

Comment 3 Luigi Scarso 2009-08-20 04:47:55 UTC

PDF 1.7 spec, at pag 472 sec 5.9.2 says:
It must use the ...beginbfrange, and endbfrange operators to define the mapping
from characters code to Unicode character sequences expressed un UTF-16BE
encoding (also see  Adobe TechNote 5411) . Is it important ?

Comment 4 Ken Sharp 2009-08-20 06:04:38 UTC

>It must use the ...beginbfrange, and endbfrange operators to define the mapping
>from characters code to Unicode character sequences expressed un UTF-16BE
>encoding (also see  Adobe TechNote 5411) . Is it important ?

Well it does do this. The character codes are expressed as single bytes, but the
Unicode code points are expressed as 16 bits. Eg:

<04><04><0021>

Staring code = 0x04, end code = 0x04, Unicode = 0x0021. So the final (Unicode)
value is expressed as a UTF-16BE value. Character codes are just numbers and
there's no indication that they need to be a particular number of bits (in
general in PostScript or PDF they are not). Its worth noting that CMaps may
encode character codes of arbitrary size, the largest I've heard of is 5 bytes
(a Chinese CMap), so requiring two byte character codes would be unhelpful.

Also, as I said, Acrobat is happy with the CMap (and the PDF file), and Acrobat
is generally regarded as the standard.

Comment 5 Luigi Scarso 2009-08-20 06:23:53 UTC

Thank you Ken for your patience -- you should forgive me, it's a bit new for me.
I'm not conviced of 
1 begincodespacerange
<00><ff>
endcodespacerange
because 5411 says 

1 begincodespacerange
<0000><FFFF>
endcodespacerange

So I wget http://www.pragma-ade.com/general/manuals/mk.pdf
and I proccessed with 
GS=/opt/ghostscript/ghostscript-8.70/bin/gs
pdf_src=$1
dest=`basename $pdf_src .pdf`_gs870.pdf
$GS -sDEVICE=pdfwrite -dNOPAUSE -dCompressFonts=false -dCompressPages=false
-dFirstPage=102   -dUseFlateCompression=false -sOutputFile=$dest $pdf_src -c quit

(note first page= 102)

I see this :



1 begincodespacerange
<00><ff>
endcodespacerange
8 beginbfrange
<05><07><dcca>
<09><09><0046>
<18><18><0055>
<83><83><0061>
<85><8b><dd56>
<8e><92><dd5f>
<94><96><dd65>
<98><9a><dd69>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end end
endstream
endobj
1049 0 obj
<</BaseFont/HCNWLM+CambriaMath-2-Identity-H/ToUnicode 1286 0 R/Type/Font
/Encoding /Identity-H/DescendantFonts[1050 0 R]/Subtype/Type0>>

it it OK ?


If I proccessed with 
GS=/opt/ghostscript/ghostscript-8.70/bin/gs
pdf_src=$1
dest=`basename $pdf_src .pdf`_gs870.pdf
$GS -sDEVICE=pdfwrite -dNOPAUSE -dCompressFonts=false -dCompressPages=false  
-dUseFlateCompression=false -sOutputFile=$dest $pdf_src -c quit

I have an error at (or after)  pag 11 , but this can be another story.


Operand stack:
   --dict:8/17(L)--   MplSh1   --dict:8/9(ro)(L)--   --dict:8/8(ro)(L)--  
--nostringval--   --dict:8/9(ro)(L)--
Execution stack:
   %interp_exit   .runexec2   --nostringval--   --nostringval--  
--nostringval--   2   %stopped_push   --nostringval--   --nostringval--  
--nostringval--   false   1   %stopped_push   1862   1   3   %oparray_pop   1861
  1   3   %oparray_pop   1845   1   3   %oparray_pop   --nostringval--  
--nostringval--   12   1   316   --nostringval--   %for_pos_int_continue  
--nostringval--   --nostringval--   false   1   %stopped_push   --nostringval--
  --nostringval--   --nostringval--   %array_continue   --nostringval--   false
  1   %stopped_push   --nostringval--   %loop_continue   --nostringval--  
--nostringval--   --nostringval--
Dictionary stack:
   --dict:1155/1684(ro)(G)--   --dict:1/20(G)--   --dict:76/200(L)--  
--dict:76/200(L)--   --dict:106/127(ro)(G)--   --dict:285/300(ro)(G)--  
--dict:22/25(L)--   --dict:4/6(L)--   --dict:25/40(L)--   --dict:1/1(ro)(G)--  
--dict:8/15(L)--   --dict:3/5(L)--
Current allocation mode is local
Last OS error: 2
GPL Ghostscript 8.70: Unrecoverable error, exit code 1

Comment 6 Luigi Scarso 2009-08-20 06:26:40 UTC

Acroread 9 under linux has no problem with mk.pdf .

Comment 7 Ken Sharp 2009-08-20 06:42:28 UTC

I'm not conviced of 
1 begincodespacerange
<00><ff>
endcodespacerange
because 5411 says 

1 begincodespacerange
<0000><FFFF>
endcodespacerange

Not sure what you mean by 5411, is this another tech note ? The ~codespacerange
operators just declare the maximum and minimum values for the character codes,
since we aren't using values outside 00->ff this is enough for us. Note this is
the *input* codes, not the output values.

Multiple bytes in the code space range have a different meaning to what you
might expect. Each bytes is considered separately, not as forming a single
value. See p48 of tech note 5014 for more on this. In the case 0000 ffff this is
the a full complement of 2 byte character codes, but we really don't need to
specify that.


Now as for mk.pdf, I don't see a problem with the CMap you quote, but I would
need to see the whole file to be more sure. Your other error I can't really
comment on.

We've drifted rather far away from your xpdf problem, if you think there is a
problem with a different file using GS you should open a new issue, if you think
there is still a problem with xpdf then you should reopen this issue.

Comment 8 Luigi Scarso 2009-08-20 07:08:59 UTC

I'm not conviced of 
1 begincodespacerange
<00><ff>
endcodespacerange
because 5411 says 

1 begincodespacerange
<0000><FFFF>
endcodespacerange

Not sure what you mean by 5411, is this another tech note ?

Yes 
ToUnicode Mapping File Tutorial 
Technical Note #5411 
Plase see also 

PDF 1.7 spec,  pag 472 subsection 5.9.2

Anyway for mk.pdf maybe I will open another issue . 

Many Thanks

Comment 9 Ken Sharp 2009-08-20 07:26:18 UTC

>?Plase see also 
>
>PDF 1.7 spec,  pag 472 subsection 5.9.2

I've looked at both these and I do not see a specific problem. Its true that the
examples all use 2-byte codespace ranges, and as a result all use 2 byte
bfchar/bfrange operators. However there is nothing in the spec which says that 2
byte ranges are a requirement. In fact the spec makes it reasonably clear (under
begincodesparange) that single byte ranges are permissible.

Given that we do use 2-byte codes (but always between 00 and ff) its possibly
arguable that we should use 2 byte ranges, but as I said, Acrobat seems happy
with what we currently output.

Comment 10 Luigi Scarso 2009-08-20 08:12:11 UTC

At pag 3 of AdobeTech#5411 there is 
the following text:

"codespacetange" definition , without exception, shall
 always be used:
1 begincodespacerange
<0000><FFFF>
endcodespacerange

See also
PDF 1.7 spec, at pag 472 sec 5.9.2 says:
"Additional guidance regarding the CMAP defined in this entry is provided
in Adobe Technical Note #5411, ToUnicode Mapping File Tutorial"


But this is not  the case of 
test_gs870.pdf

hence I reopen the issue.

Many thanks at Ken for his patience .

Comment 11 SaGS 2009-08-21 02:56:29 UTC

Im my opinion, the paragraph that actually clarifies this debate is the 
following, taken from PDF32000_2008.pdf section 9.10.3 "ToUnicode CMaps":

   "The CMap file shall contain begincodespacerange and endcodespacerange 
    operators that are consistent with the encoding that the font uses. In 
    particular, for a simple font, the codespace shall be one byte long."

Almost the same text can be found in the PDF Reference, section 
5.9.2 "ToUnicode CMaps", 3rd bullet. I consider this implies that fonts using 
2-byte charcodes should use 2-byte ‘character codes’ in the ToUnicode CMap.

This is also consistent with the following, found in section 9.7.6.2 "CMap 
Mapping", and which refers to ‘regular’ CMaps:

   "The code extracted from the string shall be looked up in the character 
    code mappings for codes of that length. ... Failing that, it shall be 
    looked up in the notdef mappings, ..."

(Note the "of that lenght".) It's clear from here the length of the code 
matters a lot. If there are no ranges of the correct length, the 
respective ‘character code’ is taken as undefined - there’s no ‘delete leading 
zeroes and retry’.

To put it another way, in a Type 0 font ‘character codes’ are conceptually not 
numbers, to be able to talk about ‘0-filling the value’, but sequences of 1 or 
more bytes. For example the <00FF> starts not as the number 255, but as the 2 
bytes 0 and 255; these are extracted from a show string and used in a certain 
way (maybe to derive a CID, which IS a number) to locate the corresponding 
character. Also note the ToUnicode CMap is attached to the Type0 composite 
font, not to its CIDFont descendent (the latter indeed identifies characters 
by numbers and not by sequences of bytes).

Comment 12 Ken Sharp 2009-08-21 03:16:35 UTC

>Almost the same text can be found in the PDF Reference, section 
>5.9.2 "ToUnicode CMaps", 3rd bullet. I consider this implies that fonts using 
>2-byte charcodes should use 2-byte ‘character codes’ in the ToUnicode CMap.

I agree with this completely, in fact the point about codespace ranges is that
they are not linear. If the range contains two bytes then the first byte of the
character code is compared to the first byte of the start code, and the first
byte of the end code. Similarly the second byte of the character code is
compared to the second byte of the start and end codes.

Since the two bytes are considered separately its pretty clear that a 2 byte
code space range requires 2 byte character codes. (but note that Acrobat appears
to promote single byte codes by inserting '00')

What isn't clear is that ToUnicode CMaps *always* have a two byte code space
range. This seems to be documented only in tech note 5411 the ToUnicode Mapping
File Tutorial where it says (as Luigi correctly points out):

"the following “codespacerange” definition, without exception, shall always be
used: 1 begincodespacerange  <0000> <FFFF>endcodespacerange"

This makes it mandatory to have a 2 byte code space range and therefore the
bfrange/bfchar parameters must also be 2 bytes.

In passing Luigi also tells me that the xpdf maintainer has already 'fixed' this
and the next release of xpdf will process these CMaps without complaint. 

Nevertheless, we should fix pdfwrite so it conforms the the 'spec'.

Comment 13 Ken Sharp 2009-08-25 01:19:02 UTC

Fixed in revision 10018, patch here:

http://ghostscript.com/pipermail/gs-cvs/2009-August/009738.html