Bug 691872 - Ghostscript looses umlaut characters when re-distilling PDFs using unembedded "MS Serif" or "MS Sans Serif" bitmap fonts
Summary: Ghostscript looses umlaut characters when re-distilling PDFs using unembedded...
Status: RESOLVED FIXED
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Writer (show other bugs)
Version: 8.71
Hardware: PC Windows Vista
: P4 normal
Assignee: Alex Cherepanov
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-01-07 14:37 UTC by pipitas
Modified: 2011-01-08 17:57 UTC (History)
0 users

See Also:
Customer:
Word Size: ---


Attachments
Original (pre-damaged) PDF file using un-embedded bitmap fonts "MS Serif" and "MS Sans Serif" (14.76 KB, application/pdf)
2011-01-07 14:37 UTC, pipitas
Details
Re-distilled by Ghostscript, embedded substitute fonts, but umlauts missing (16.88 KB, application/pdf)
2011-01-07 14:38 UTC, pipitas
Details
Original PDF after editing in the missing "/Subtype /Type1" part (14.77 KB, application/pdf)
2011-01-07 14:40 UTC, pipitas
Details
Ghostscript re-distilled the manually edited PDF, including umlauts correctly. (17.76 KB, application/pdf)
2011-01-07 14:41 UTC, pipitas
Details
patch (426 bytes, patch)
2011-01-08 09:01 UTC, Alex Cherepanov
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description pipitas 2011-01-07 14:37:18 UTC
Created attachment 7102 [details]
Original (pre-damaged) PDF file using un-embedded bitmap fonts "MS Serif" and "MS Sans Serif"

Ghostscript looses umlaut characters when re-distilling PDFs using unembedded "MS Serif" or "MS Sans Serif" bitmap fonts

Executive Summary:
If Ghostscript re-distills PDFs using unembedded (bitmap) fonts "MS Serif" or "MS Sans Serif" all umlaut characters get lost. This I discovered with GS v8.62, but I verified that this is still the case with GS v8.71 and v9.00 too.
Details:

The PDFs in question are damaged already. They were produced by Adobe Acrobat's "PDFMaker" macro. The files' fault lies in their empty /Subtype key for the /Font definition. When Ghostscript re-distills these PDFs using substitute fonts, it assumes /Subtype /Type1, but it fails to correctly embed German umlauts (and possibly more special characters).

However, if one edits in the missing /Subtype /Type1  with a plain text editor into the original PDF, Ghostscript correctly embeds the umlauts. So, while this is not a Ghostscript problem per se (as the original files aren't kosher anyway), it would be nice if Ghostscript could succeed full-scope in repairing those files (instead of succeeding only partially).

How to reproduce:

1. Use Excel to create a table using "MS Sans Serif" or "MS Serif" bitmap fonts. Use German Umlauts in some strings.

2. Use the Acrobat PDFMaker macro to create a PDF/A ("ms-sans+serif.pdf"). The PDF will show up in Acrobat and in Reader pretending it's PDF/A.
 
3. However, the file will not validate as PDF/A. Adobe Pro will not be able to convert it into a really validating PDF/A. callas pdfToolbox will also fail with this task. Both applications will say they could not embed the missing fonts.

4. The PDF itself is damaged in the sense that it contains an empty "/Subtype" for the font(s). This is also shown by "pdffonts" from the XPDF-utils:

        pdffonts.exe "ms-sans+serif.pdf"
           name                                 type              emb sub uni object ID
           ------------------------------------ ----------------- --- --- --- ---------
           Error: Unknown font type: ''
           Error: Unknown font type: ''
           MS Sans Serif                        unknown           no  no  no      12  0
           MS Serif                             unknown           no  no  no      13  0

5. Ghostscript succeeds embedding font substitutes (Times/Helvetica) when re-distilling the file with dPDFSETTINGS=/prepress:

       gswin32c ^
          -Ic:\pa\gs\gs8.62\lib ^
          -o "GS_ms-sans+serif.pdf" ^
          -sDEVICE=pdfwrite ^
          -dPDFSETTINGS=/prepress ^
          "ms-sans+serif.pdf"
       GPL Ghostscript 8.62 (2008-02-29)
       Copyright (C) 2008 Artifex Software, Inc.  All rights reserved.
       This software comes with NO WARRANTY: see the file PUBLIC for details.
       Processing pages 1 through 1.
       Page 1
          **** Warning: Font missing required Subtype, /Type1 assumed.
       Substituting font Helvetica for MS Sans Serif.
       Can't find (or can't open) font file n019003l.pfb.
       Loading NimbusSanL-Regu font from c:\pa\gs\gs8.62\Resource/Font/NimbusSanL-Regu... 2860112 1319936 13553080 12233888 3 done.
          **** Warning: Font missing required Subtype, /Type1 assumed.
       Substituting font Times-Roman for MS Serif.
       Can't find (or can't open) font file n021003l.pfb.
       Loading NimbusRomNo9L-Regu font from c:\pa\gs\gs8.62\Resource/Font/NimbusRomNo9L-Regu... 2860112 1401289 13609632 12298497 3 done.
        
          **** This file had errors that were repaired or ignored.
          **** The file was produced by:
          **** >>>> Adobe PDF Library 9.0 <<<<
          **** Please notify the author of the software that produced this
          **** file that it does not conform to Adobe's published PDF
          **** specification.
        
       pdffonts.exe "GS_ms-sans+serif.pdf"
          name                                 type              emb sub uni object ID
          ------------------------------------ ----------------- --- --- --- ---------
          HJIJZI+Helvetica                     Type 1C           yes yes no      12  0
          KRVUQQ+Times-Roman                   Type 1C           yes yes no      14  0

Repair seemed to have succeeded. However, all the umlaut characters are missing now (almost) completely in file "GS_ms-sans+serif.pdf".
 
6a. Repair the original PDF "ms-sans+serif.pdf" with a text editor by inserting /Type1 as  the /Subtype key  for the fonts (not caring about XRef inconsistency). Save it as "ms-sans+serif-type1-repaired.pdf". 

As is to be expected, pdffonts now complains about the damaged XRef table, but indicates the font types now as Type1:

       pdffonts "ms-sans+serif-type1-repaired.pdf"
          Error: PDF file is damaged - attempting to reconstruct xref table...
          name                                 type              emb sub uni object ID
          ------------------------------------ ----------------- --- --- --- ---------
          MS Sans Serif                        Type 1            no  no  no      12  0
          MS Serif                             Type 1            no  no  no      13  0

6b. Re-distill the repaired PDF ("ms-sans+serif-type1-repaired.pdf") with Ghostscript:

       gswin32c ^
         -Ic:\pa\gs\gs8.62\lib ^
         -o "GS_ms-sans+serif-type1-repaired.pdf" ^
         -sDEVICE=pdfwrite ^
         -dPDFSETTINGS=/prepress ^
         "ms-sans+serif-type1-repaired.pdf"
       GPL Ghostscript 8.62 (2008-02-29)
       Copyright (C) 2008 Artifex Software, Inc.  All rights reserved.
       This software comes with NO WARRANTY: see the file PUBLIC for details.
          **** Warning:  An error occurred while reading an XREF table.
          **** The file has been damaged.  This may have been caused
          **** by a problem while converting or transfering the file.
          **** Ghostscript will attempt to recover the data.
       Processing pages 1 through 1.
       Page 1
       Substituting font Helvetica for MS Sans Serif.
       Loading NimbusSanL-Regu font from c:\pa\gs\gs8.62\Resource/Font/NimbusSanL-Regu... 2860112 1321143 13573176 12241104 3 done.
       Substituting font Times-Roman for MS Serif.
       Loading NimbusRomNo9L-Regu font from c:\pa\gs\gs8.62\Resource/Font/NimbusRomNo9L-Regu... 2860112 1402519 13629728 12321258 3 done.
        
          **** This file had errors that were repaired or ignored.
          **** The file was produced by:
          **** >>>> Adobe PDF Library 9.0 <<<<
          **** Please notify the author of the software that produced this
          **** file that it does not conform to Adobe's published PDF
          **** specification.

The result of this re-distillation is OK (all umlauts appear correctly in "GS_ms-sans+serif-type1-repaired.pdf").

       pdffonts "GS_ms-sans+serif-type1-repaired.pdf"
          name                                 type              emb sub uni object ID
          ------------------------------------ ----------------- --- --- --- ---------
          HJIJZI+Helvetica                     Type 1C           yes yes no      12  0
          KRVUQQ+Times-Roman                   Type 1C           yes yes no      14  0

7. Repeat steps 6a+6b, but this time edit into the damaged original file /Subtype /TrueType. Result will also be OK with all umlauts appearing correctly.
Comment 1 pipitas 2011-01-07 14:38:55 UTC
Created attachment 7103 [details]
Re-distilled by Ghostscript, embedded substitute fonts, but umlauts missing
Comment 2 pipitas 2011-01-07 14:40:29 UTC
Created attachment 7104 [details]
Original PDF after editing in the missing "/Subtype /Type1" part
Comment 3 pipitas 2011-01-07 14:41:51 UTC
Created attachment 7105 [details]
Ghostscript re-distilled the manually edited PDF, including umlauts correctly.
Comment 4 Alex Cherepanov 2011-01-08 09:01:59 UTC
Created attachment 7106 [details]
patch

Fix the code that repairs missing or incorrect /Subtype attribute. Write
a valid attribute (always /Type1) into the font resource. This helps to 
avoid confusion caused by an invalid value later on.
Comment 5 Alex Cherepanov 2011-01-08 17:57:01 UTC
The patch has been committed as a rev. 12011.