Bug 690935

Summary: ghostscript losses some letters at gs to pdf conversion
Product: Ghostscript Reporter: Alex <alex-kostjukov>
Component: PDF InterpreterAssignee: Alex Cherepanov <alex>
Status: RESOLVED WONTFIX    
Severity: major    
Priority: P4    
Version: 8.64   
Hardware: PC   
OS: Linux   
Customer: Word Size: ---
Attachments: There is input postscript file
The output pdf file
Output after postscript-pdf-postscript

Description Alex 2009-11-19 07:32:26 UTC
The problem concerns particularly Cyrillic capital letter &#8220;Short I&#8221;, U+0419 in
Arial, Courier New, and Times New Roman; small letter &#8220;Be&#8221;, U+0431 in DejaVu
Sans.
Postscript documents containing such letters in fonts mentioned loss them after
postscript-to-pdf conversion.
This particularly leads to corrupted printout of openoffice.org documents
printed via cups print manager. The sequence looks as follows: Openoffice.org
sends postscript document to cups where it applies postscript-to-pdf conversion
using ghostscript utilities where the letters in subject are losing.

There are postscript file in the attachment containing only &#8220;Shot I&#8221; letter in
Arial, and pdf file made through ghostscript with this letter lost.

Input postscript file was produced by openoffice.org.
Conversion was made via: gs -sDEVICE=pdfwrite -sOutputFile=ShortI.pdf ShortI.ps

Environment:
Ghostscript: 8.64.dfsg.1-0ubuntu8
Openoffice.org: OOO310m19. Build 9420
Fonts: installed by distribution, DejaVu Sans of the latest version was also
tried
Cups: 1.3.9-17ubuntu3.4. People had tried a couple of cups versions. Generally,
the problems starts with cups version that started using ps2pdf conversion
provided by ghostscript
OS: Kubuntu 9.04

Here are some bug reports on the issue:
https://bugs.launchpad.net/bugs/449255
http://www.openoffice.org/issues/show_bug.cgi?id=106833

There are some of the very same issues with other characters and fonts that may
be caused by this problem:
http://www.openoffice.org/issues/show_bug.cgi?id=104050
https://bugs.launchpad.net/openoffice/+bug/376953
http://www.openoffice.org/issues/show_bug.cgi?id=105631

P.S.
How could I send the attachment?
Comment 1 Alex 2009-11-19 07:34:02 UTC
Created attachment 5692 [details]
There is input postscript file
Comment 2 Alex 2009-11-19 07:34:38 UTC
Created attachment 5693 [details]
The output pdf file
Comment 3 Ken Sharp 2009-11-19 08:38:47 UTC
I am unclear on how you are testing this. The file you have supplied 'The output
file.pdf' displays a single glyph on Acrobat Reader 4 and 5, and Acrobat
Professional 7 and 9.

Converting the file 'There is input postscript file.ps' to PDF using Acrobat
Distiller 9 produces a PDF file with a single identical glyph.

In short, I cannot see a problem with the PDF file produced by pdfwrite.
Comment 4 Alex 2009-11-19 10:56:54 UTC
Thank you for reply.

You're right, Adobe Reader shows glyph correctly. I used another pdf renderer
that didn't show it.

So, postscript file from openoffice.org seems to be valid. Unfortunately this
doesn't solve the problem, the &#8220;Short I&#8221; doesn't print correctly. 

May be the bug is in further processing that finishes by pdf-to-postscript
conversion, and the result is then sent to printer. I tried to model this
process omitting intermediate transformations, leaving only postscript-pdf-
by:

gs -sDEVICE=pswrite -sOutputFile=ShortI.pdf-gs.ps ShortI.pdf

There is no glyph for &#8220;Short I&#8221; was shown in my renderer for ShortI.pdf-gs.ps.
To verify it against  Adobe Reader I converted current result, ShortI.pdf-

gs -sDEVICE=pdfwrite -sOutputFile=ShortI.pdf-gs.pdf ShortI.pdf-gs.ps

Adobe Reader didn't showed &#8220;Short I&#8221; in resulting pdf, ShortI.pdf-gs.pdf, too.

What is special with the pdf initially created by ghostscript from input
postscript file so it cannot be converted back into postscript?
Comment 5 Alex 2009-11-19 10:59:10 UTC
Created attachment 5695 [details]
Output after postscript-pdf-postscript
Comment 6 Alex 2009-11-19 11:12:45 UTC
Somehow it messed my prev. message. In short, conversion chain "postscript then
pdf then postscript" done using ghostscript doesn't seems to produce ShortI in
final postscript. It isn't shown in postscript rendered by my system rendered,
nor by gs command line renderer (on screen) nor by Adobe Reader from pdf cooked
from this postscript by ghostscript
Comment 7 Ken Sharp 2009-11-20 00:13:53 UTC
Since the problem occurs with rendering, and the PDF displays correctly in
Acrobat, this is probably not a PDF writer problem. It could be a font issue or
it could be a problem with the PDF interpreter.

Assigning to Alex initially to see if he can tell which component is the problem.
Comment 8 Alex Cherepanov 2009-11-21 07:28:41 UTC
PDF files generated by both Ghostscript and Distiller 5 render correctly
with -dRENDERTTNOTDEF flag.
Comment 9 Ken Sharp 2009-11-23 00:40:09 UTC
What's happening here is that the font is indeed being defined in such a way
that the glyph being used is a .notdef glyph:

/Encoding 256 array def
    0 1 255 {Encoding exch /.notdef put} for
Encoding 0 /glyph3 put
...
...
/CharStrings 2 dict dup begin
/.notdef 0 def
/glyph1 1 def
/glyph2 2 def
/glyph3 3 def
...
...

(ArialMTFID33HGSet2) cvn findfont 100 -100 matrix scale makefont setfont
<01>
show

The show operation uses the character code 0x01, the Encoding is set up with
position 0 being /glyph3 and all other positions being /.notdef. So as a result
we render the /.notdef glyph.

The default behaviour for Ghostscript is to render TrueType /.notdef glyphs when
the input is PostScript, and *not* to render TrueType /.notdef glyphs when the
input is PDF. Hence why this works when you run the original PostScript, but
doesn't work when you run the PDF file.

We know from the original work on this issue (see bug #689757) that the rules
Acrobat uses on whether to render a /.notdef or not are incomprehensible. In
particular we know that making a font symbolic does not force display. I mention
this because I had thought that the fact that the font was symbolic was why
Acrobat displayed this one.

I believe the simple answer is that if you want to render PDF where real glyphs
are encoded as the /.notdef glyph you will have to set -dRENDERTTNOTDEF as Alex
noted above. This will *also* render notdef glyphs in files where the /.notdef
glyph is defined as a hollow rectangle, which will give rise to the 'hollow
boxes' complaint, which is why this switch defaults to disabled.

In case anyone wants to push back upstream with this, the file was created by
OpenOffice 3.1, and the subset font was created using "SunTypeTools-TT 1.0
gelf", accordinf to the comments, the original font was :

%%Creator: SunTypeTools-TT 1.0 gelf
%- Font subset generated from a source font file:
'/usr/share/fonts/truetype/msttcorefonts/Arial.ttf'
%- Original font name: ArialMT
%- Original font family: Arial
%- Original font sub-family: Regular

Obviously I don't know what the original font looked like, and I haven't decoded
the sfnts array, but I very much doubt if it had a glyph called /.notdef which
was a real Cyrillic character.
Comment 10 Alex 2009-11-23 10:17:00 UTC
Thank you for such detailed reply.

I tried to model CUPS pipeline again, using
gs -dRENDERTTNOTDEF -sDEVICE=pswrite
for final conversion. The result rendered Short I with either renderer I have,
even with that one (Okular) doesn't render the letter in the input PDF file.

As I see, there is no bug with this in ghostscript, the behavior can be chosen:
you either may have Short I but you may get hollow rectangles or you have no
Short I as well as no chances to get rectangles.

Only specific thing I noted about original font, I mean Arial.ttf, is that the
Short I, Unicode 0419, is represented by combination of two other glyphs, the
base Unicode 0418, and the breve Unicode 0306. They're specified to be
components of the Short I. The font has &#8220;.notdef&#8221; glyph, it looks like the
famous hollow rectangle, font program complains on it: &#8220;Glyph 1295 is called
".notdef", a singularly inept choice of name (only glyph 0 may be called
.notdef)&#8221;
The other problematic letter, DejaVu Sans, Be, Unicode 0431, has no components
defined, so it isn't the rule.

However fonts' and postscript details are the matters where I'm not too strong.
Would I be wrong if I conclude from your description that Short I is
represented in the original postscript in a non-standard way, and thus it may
be rendered with some settings/renderers but may not with others? In other
words, is this way the reason of that the people are getting Short I and other
such encoded letters dropped from printouts in some configurations of CUPS
pipeline and aren't getting in others (in older systems)? And, is the technical
postscript-related reason may exist forcing representing letters in such a way?

Thank you again for detailed description.
Comment 11 Ken Sharp 2009-11-24 01:24:19 UTC
>Only specific thing I noted about original font, I mean Arial.ttf, is that the
>Short I, Unicode 0419, is represented by combination of two other glyphs, the
>base Unicode 0418, and the breve Unicode 0306. They're specified to be
>components of the Short I. 

I assumed the original font was fine, the font being used in this case is a
subset font containing only the glyphs needed for the document. It appears to
have been created by the SunTypeTools-TT program.

>The font has &#8220;.notdef&#8221; glyph, it looks like the
>famous hollow rectangle, font program complains on it: &#8220;Glyph 1295 is
>called ".notdef", a singularly inept choice of name (only glyph 0 may be called
>.notdef)&#8221;

Yes, this is the problem. PostScript uses glyph names, TrueType uses numeric IDs
(GID). In both cases the font technology defines a glyph to be used when the
requested glyph is not present. PostScript calls this glyph '/.notdef', TrueType
defines it as GID 0. Of course we need a way to map PostScript glyph names to
TrueType GIDs, and what happens here is that the PostScript glyph named /.notdef
is not assigned to GID 0.

In fact the glyph named /.notdef is not a fallback glyph at all, its a real
glyph, and that's where the problem arises.


>Would I be wrong if I conclude from your description that Short I is
>represented in the original postscript in a non-standard way, and thus it may
>be rendered with some settings/renderers but may not with others?

Its not so much non-standard as completely mad, see my comments above :-)
PostScript being a flexible programming language its technically possible and
theoretically legal, but its not sensible.

In PostScript we always render the /.notdef glyph, because that's the way the
specification is written and mostly everyone sticks to the spec. In PDF,
however, although the spec is written so that the /.notdef glyph should be
rendered, Adobe Acrobat 'sometimes' (and I haven't been able to work out a rule
for this) doesn't render the /.notdef but instead leaves a gap equivalent to its
width.

This leads to complaints about 'hollow squares' or 'boxes'. Of course these are
technically correctly rendered, but Acrobat doesn't display them so we are seen
as incorrect.

This is what the RENDERTTNOTDEF flag is for, it defaults to 'don't render'
because that gets better equivalence with Acrobat, but in this case it means a
real glyph doesn't get drawn, because it has been given the name /.notdef.

>is this way the reason of that the people are getting Short I and other
>such encoded letters dropped from printouts in some configurations of CUPS
>pipeline and aren't getting in others (in older systems)?

The RENDERTTNOTDEF flag is relatively new, anyone running an older version of
Ghostscript won't have the flag, and in this case the behaviour is the same for
PostScript as PDF, the /.notdef glyph *is* rendered. So older systems will work
'correctly'.

>And, is the technical
>postscript-related reason may exist forcing representing letters in such a way?

There is no good reason for the glyph to be named /.notdef and this is the
source of all the problems. In fact there are good reasons *not* to name a glyph
as /.notdef, its confusing at the very least, and if we tried to use a glyph
which was missing in the original font we would get the 'Short I' glyph instead
of the usual hollow square.

That of course is very difficult to spot when proofing, which is the point of TT
fonts using a more or less instantly recognisable 'error' glyph. (PostScript
fonts often use a 'space' for /.notdef, that is, no marks are made)
Comment 12 Smirnovsky Alexander 2009-11-26 11:24:20 UTC
Hello everyone!

I tried to use the option -dRENDERTTNOTDEF while converting ps-file obtained
from OpenOffice 3.1 to pdf-file (using ps2pdf) and my Okular did not show the
Russian letter Short I.
I use Ubuntu 9.10 x86_64, the version of ghostscript is 8.70.

May be, such behavior is due to patching of ghostscript by Debian team? I tried
to compile from original source but without success.

The problem is solved when I use ps2ps utility and then ps2pdf (without flag
-dRENDERTTNOTDEF).
Comment 13 Alex 2009-11-26 23:25:38 UTC
To: Ken Sharp

Thank you for comprehensive explanation