Bug 691274 - Missing or incorrect ToUnicode when using Identity ordering
Summary: Missing or incorrect ToUnicode when using Identity ordering
Status: RESOLVED WONTFIX
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Writer (show other bugs)
Version: 8.63
Hardware: PC Windows XP
: P4 normal
Assignee: Ken Sharp
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-04-30 09:23 UTC by Per Sundin
Modified: 2010-05-07 09:17 UTC (History)
0 users

See Also:
Customer:
Word Size: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Per Sundin 2010-04-30 09:23:16 UTC
I want to convert my PostScript files to PDF, using the "Arial Unicode MS" TrueType font and the "Identity" ordering, so that I can support multiple languages with a single font. Since the "Identity" ordering selects glyphs by their UTF-16 values and my PostScript files are encoded using UTF-8, I have created a wrapper procedure for the "show" operator, which converts all strings from UTF-8 to UTF-16 before invoking the real "show" operator.
The resulting PDF is displayed and printed correctly, but the search functions does not work. I have examined the PDF file and discovered that the "ToUnicode" CMap is missing.

It there some way to force the "ToUnicode" to be included in the PDF file, when using the "Identity" ordering?

I have also tried to create a custom "Identity-UTF16-H" CMap file. Then, in fact, a ToUnicode was indeed included in the PDF, but it was not a correct UTF-16 encoding: 

<0000> <01ff> <0000> % OK
<0100> <02ff> <0000> % This is not UTF-16!
<0200> <03ff> <0100> % This is not UTF-16!
<0300> <04ff> <0200> % This is not UTF-16!
<0400> <05ff> <0300> % This is not UTF-16!

The name of the ToUnicode CMap above is "Identity-BF-H". What does this mean?
Comment 1 Ken Sharp 2010-04-30 09:38:36 UTC
In order to create a ToUnicode CMap the pdfwrite device needs to be informed about how the glyphs in the file are converted into Unicode code points.

There are various ways this can be done but I would suggest that your PostScript file is simply not giving that information to the device. This is to be expected since there is no standard way to do so in PostScript.

glyph name conversion is one route, but the most common is the inclusion of a ToUnicde CMap in the PostScript file. I think this is what you will need to do if you want to have pdfwrite generate a ToUnicode CMap in the PDF file for you.
Comment 2 Per Sundin 2010-04-30 11:07:36 UTC
As far as I understand, there is no such thing as a "ToUnicode CMap" in PostScript. It is a pure PDF construct, right? In PDF, ToUnicode is a CMap stream object that should be placed in the dictionary of the root font (the Type 0, "Composite" font). How do I tell GhostScript to put it there?
Comment 3 Ken Sharp 2010-04-30 11:37:55 UTC
(In reply to comment #2)
> As far as I understand, there is no such thing as a "ToUnicode CMap" in
> PostScript. It is a pure PDF construct, right? 

Well, technically you can give a CMap any name you like, but yes you're correct, I actually meant a GlyphNames2Unicode dictionary. Apologies for the terminology confusion.

The GlyphNames2Unicode dictionary needs to be placed in the FontInfo dictionary of the embedded font.
Comment 4 Per Sundin 2010-04-30 16:40:23 UTC
You are probably right, but I'm not quite sure how to actually do it in code. I have tried, but I get the error "/rangecheck in --.buildcmap--" error back from GhostScript. I have also tried to skip the "defineresource" step, but then I get another error: "/invalidaccess in --put--". It would be much appreciated if you could have a look at my code. Maybe you see directly what the problem is. Here is the code:

/RootFont /Helvetica-ISOLatin1 findfont def
/EmbeddedFont RootFont /FDepVector get 0 get def
/FontInfo EmbeddedFont /FontInfo get def

/CIDInit /ProcSet findresource
begin
  8 dict
  begin
    begincmap

      /CIDSystemInfo 3 dict dup
      begin
        /Registry (Adobe) def
        /Ordering (Identity) def
        /Supplement 0 def
      end def

      /CMapName /Adobe-Identity-000 def
      /CMapVersion 1.000 def
      /CMapType 2 def
      /WMode 0 def

      1 begincodespacerange
        <0000> <FFFF>
      endcodespacerange
      1 beginbfrange
        <0000> <FFFF> <0000>
      endbfrange

    endcmap
    CMapName currentdict /CMap defineresource pop

    currentdict % Save on operand stack?

  end % CMap dictionary
end % CIDInit ProcSet

FontInfo exch /GlyphNames2Unicode exch put
Comment 5 Ken Sharp 2010-05-01 08:56:45 UTC
GlyphNames2Unicode is not anything to do with the CMap, its a dictionary entry in the FontInfo dictionary, in the Font dictionary.

This is an undocumented Adobe extension to the PostScript language. As noted previously there is no provision for Unicode in PostScript, so there is no standard method for creating a ToUnicode CMap. The information on this entry noted below is gathered from PostScript files and the observed behaviour of Adobe applications, and may not be complete or correct.

For regular PostScript fonts pdfwrite will try and assemble a ToUnicode CMap using the glyph names of the entries in the CharStrings dictionary. This is not 100% reliable of course as embedded (particularly subset) fonts may use meaningless glyph names, or may simply use standard names for non-standard glyphs.

CIDFonts do not have glyph names, so this approach cannot work, this is where the GlyphNames2Unicode entry is particularly useful, as it can associate either glyph names or CIDs with Unicode points.

The dictionary contains up to 65534 entries which are of the form either:

/glyhname <Unicode code point>

Or 

CID <Unicode code point>


Please note that I haven't attempted any of the following myself, this is an outline of how to proceed, not a recipe.

I'm assuming that you are using the Arial TrueType font from disk, and adding an entry to cidfmap so that Ghostscript is able to treat that font as a CIDFont, using a suitable CMap.

You will need to make a copy of that font dictionary by copying all the contents of the font dict. The FontInfo dict needs to be copied, and into the copy you need to insert a new dictionary named GlyphNames2Unicode. You will need to populate the GlyphNames2Unicode dictionary with appropriate CIDs and Unicode code points. Finally you will need to call definefont with the modified font dict.

I'm fairly sure there will be some missing bits in there, so I'd suggest you start by simply copying the font and defining a new font with a new name from the copied font dictionary, then work on inserting a GlyphNames2Unicode dictionary in the FontInfo dictionary. From there it should be relatively easy to match the CIDs to the Unicode points and create a working ToUnicode CMap in the output PDF.
Comment 6 Per Sundin 2010-05-03 09:43:41 UTC
Yes, the "Arial Unicode MS" is represented by a Type 2 CIDFont dictionary. My original line for creating the composite font is as follows:

> /MyFont /Identity-H /CMap findresource
>  [ /ArialUni /CIDFont findresource ] composefont pop

I'm not sure how to use the "definefont" operator to create a composite font; would it be ok to use "composefont" instead? I was thinking about something like this:

> /EmbeddedFont /ArialUni /CIDFont findresource def
>
> true setglobal
> ... % Make a copy of EmbeddedFont -> EmbeddedFont2
> /FontInfo EmbeddedFont2 /FontInfo get def
> ...
>    % Create ToUnicode dictionary and placed it on the operand stack!
> ...
> FontInfo exch /GlyphNames2Unicode exch put
> false setglobal
> 
> FontInfo exch /GlyphNames2Unicode exch put
> /MyFont /Identity-H /CMap findresource [ EmbeddedFont2 ] composefont pop

I noticed that I need to switch to global VM mode in order to be able to save the entry in the font dictionary. I assume that the new copy will also need to be saved in global VM, is that right?

I also noticed that I cannot use the "defineresource" operator to define my new dictionary  as a CMap resource, because the value of 2 is not regarded as a valid CMapType value by the PostScript interpreter, it seems. Maybe the ToUnicode CMap does not need to be associated with the "CMap" name?

It will probably take me a while to figure out how to do the copy. As I understand, the copy operator only makes a shallow copy. Maybe there is some other way to do it, that I have not thought about?
Comment 7 Ken Sharp 2010-05-03 10:31:28 UTC
(In reply to comment #6)

> > /MyFont /Identity-H /CMap findresource
> >  [ /ArialUni /CIDFont findresource ] composefont pop
> 
> I'm not sure how to use the "definefont" operator to create a composite font;
> would it be ok to use "composefont" instead?

I'm doubtful, but it might work. To add the dictionary to an existing font you simply need to copy all the contents of the existing font dictionary, add the GlyphNames2Unicode dictionary to the copied FontInfo dictionary and execute definefont (remembering also to alter the FontName, and UIDs if present).


> I was thinking about something
> like this:

That may work, I haven't tried it. As long as composefont copies all the contents of the FontInfo dictionary from the CIDFont to the type 0 font, then it should be fine, and its a good way to proceed. 

I don't know off-hand whether pdfwrite extracts the GlyphNames2Unicode information from the parent type 0 font created by composefont, or whether it expects to find a GlyphNames2Unicode dictionary in each of the descendant fonts. I would guess the former though. This is where this approach may fail, if composefont does not copy the GlyphNames2Unicode dictioanry, or does not copy it into the correct place(s).


> I noticed that I need to switch to global VM mode in order to be able to save
> the entry in the font dictionary. I assume that the new copy will also need to
> be saved in global VM, is that right?

You can save global objects in local VM, you can't save local objects in global VM, see the PostScript Language Reference Manual, page 60 in my edition.

I presume the font dictionary is defined in global VM, which means the FontInfo dictionary is also in global VM, which means any composite objects (such as dictionaries) that you want to store in the FontInfo dictionary must also be in global VM.

Of course if you copy the contents of the font dictionary the new font dictionary, and its contents, need not be in global VM. Fonts are often treated as a special case because you normally expect them to persist for the life of the PostScript program, so they are normally defined in global VM, but its not compulsory.


> I also noticed that I cannot use the "defineresource" operator to define my new
> dictionary  as a CMap resource, because the value of 2 is not regarded as a
> valid CMapType value by the PostScript interpreter, it seems. Maybe the
> ToUnicode CMap does not need to be associated with the "CMap" name?

The ToUnicode CMap shouldn't exist on the PostScript side. This CMap is created from the GlyphNames2Unicode dictionary in the FontInfo dictioanry (if present) by the pdfwrite device.

The only CMap you need is the one you use as the argument to composefont to create the CID-keyed instance. I would imagine you want one of the supplied Identity or Unicode mappings, possibly Identity-UTF16-H though I notice you are using Identity-H in the example above.

In PostScript the only valid CMap types are 0 or 1, type 2 is a PDF-only 'ToUnicode' map type.

 
> It will probably take me a while to figure out how to do the copy. As I
> understand, the copy operator only makes a shallow copy. Maybe there is some
> other way to do it, that I have not thought about?

You need to use forall to enumerate the contents of the font dictionary. For each object in the font dictionary you can either shallow copy it, or for composite objects (dictionaries, arrays) you can use forall again to make copies of the contents of the object. I would think that almost everything you need (as composite objects) could be simply 'copy'ed with the exception of the FontInfo dictionary.
Comment 8 Per Sundin 2010-05-04 12:07:44 UTC
I have tried to make copies of the font objects, but that's easier said than done, at least for me. You are more than welcome do give it a try. Here is the complete code of my example:

%!PS-Adobe-3.0

/vpos 720 def % vertical posision
/hpos 72 def % horizontal posision
/word (xx) def

/newpage
{
  /vpos 720 def
  /hpos 72 def
  hpos vpos moveto
} def

/newline
{
  /vpos vpos 25 sub def
  /hpos 72 def
  hpos vpos moveto
} def

/newhpos
{
  /hpos hpos 25 add def
  hpos vpos moveto
} def

/decodeUtf8
{
  /inStr exch def
  inStr length 2 mul string
  /outStr exch def

  /tmp 0 def
  /i 0 def
  /count 0 def
  
  inStr
  {
    /byte exch def

    i 0 eq
    {
      byte 128 lt % 0x80
      {
        /tmp tmp byte add def
        /i 0 def
      }{
        byte 224 lt % 0xE0
        {
          byte 192 sub 64 mul
          /tmp exch tmp exch add def
          /i 1 def % one more to go!
        }{
          byte 240 lt % 0xF0
          {
            byte 224 sub 4096 mul
            /tmp exch tmp exch add def
            /i 2 def % two more to go!
          }{
            byte 240 sub 262144 mul
            /tmp exch tmp exch add def
            /i 3 def % three more to go
          } ifelse
        } ifelse
      } ifelse
    }{
      i 1 eq { /tmp tmp byte 128 sub add def } if
      i 2 eq { /tmp tmp byte 128 sub 64 mul add def } if
      i 3 eq { /tmp tmp byte 128 sub 4096 mul add def } if
      /i i 1 sub def
    } ifelse

    i 0 eq
    {
      % One character completely read!
      tmp 65535 le
      {
        tmp 256 idiv
        outStr exch count exch put
    
        tmp 256 mod
        outStr exch count 1 add exch put
      }{
        % surrogate pair?
        0 outStr exch count exch put
        0 outStr exch count 1 add exch put
      } ifelse

      /tmp 0 def
      /i 0 def
      /count count 2 add def
    } if
  } forall
  outStr 0 count getinterval
} def

/MyCompFont /Identity-H /CMap findresource [/ArialUni /CIDFont findresource] composefont pop

/MyCompFont findfont
12 scalefont
setfont
newline

% English letters...
(abcdef) decodeUtf8 show
newline

% Russian letters...
<D0 90 D0 91 D0 92 D0 93 D0 94 D0 95> decodeUtf8 show
newline

showpage
Comment 9 Ken Sharp 2010-05-07 08:31:13 UTC
(In reply to comment #8)
> I have tried to make copies of the font objects, but that's easier said than
> done, at least for me. You are more than welcome do give it a try. 

I'm sorry but I'm afraid this is beyond the scope of support we can offer to free users. If a commercial customer should request this enhancement then we will rethink the decision.
Comment 10 Per Sundin 2010-05-07 09:17:12 UTC
Ok, thanx anyway. I will try your suggestions later; I just don't have enough time right now. It's interesting though, that GhostScript puts in an incorrectly encoded ToUnicode dictionary when I use my "Identity-UTF16-H" CMap. Also, when using an "Identity" CMap, isn't it quite obvious that I also want a corresponding ToUnicode?
Best regards,
Per