Bug 707369

Summary: Ghostscript pdfwrite mangles Unicode characters in certain fonts/files
Product: Ghostscript Reporter: James R Barlow <james>
Component: PDF WriterAssignee: Default assignee <ghostpdl-bugs>
Status: RESOLVED WONTFIX    
Severity: normal CC: vincent-gs
Priority: P2    
Version: 10.02.0   
Hardware: PC   
OS: Linux   
Customer: Word Size: ---
Attachments: input file

Description James R Barlow 2023-12-07 21:55:13 UTC
Created attachment 25077 [details]
input file

For the demonostration file in.pdf, programs such as poppler pdftotext can read Unicode from the input file without issue. Other GUI PDF viewers can copy text correctly as well.

The first 3 lines read as follows;

$ pdftotext -enc UTF-8 in.pdf - | head -3

```
MAIF
CS 90000 - 79038 NIORT cedex 9
Société d'assurance mutuelle à cotisations variables
```

After processing with Ghostscript 10.02, the text is corrupted:

$ gs -sDEVICE=pdfwrite -o out.pdf in.pdf


```
MAIF
CS 90000 - 79038 NIORT cedex 9
SociŽtŽ d'assurance mutuelle ˆ cotisations variables
```
Comment 1 Ken Sharp 2023-12-08 08:58:14 UTC

*** This bug has been marked as a duplicate of bug 704681 ***
Comment 2 Ken Sharp 2023-12-08 09:30:50 UTC
A quick perusal of the actual input file reveals that the font in question has no ToUnicode CMap. There is, therefore, no Unicode information for that font at all.

In the absence of a ToUnicode CMap copy/paste from PDF files is unreliable.

The file does contain other subsets of the Arial family, in fact it contains 5 subsets of Arial-BoldMT, 2 of Arial-ItalicMT and 6 of ArialMT. Of these one of each face contains a ToUnicode CMap, the remaining 10 instances, and the Verdana subset do not.

The output file contains 3 instances of Arial-BoldMT, one of Arial-ItclicMT and 5 of ArialMT. Some of the subsets have been combined resulting in a smaller file. However this involves re-encoding the font, which is entirely legitimate. The appearance of the PDF file remains unchanged.

In the absence of a ToUnicode CMap any PDF consumer is reduced to guessing what the character codes represent. With your input file the 'unusual' glyphs are encoded in a font which uses a MacRoman encoding, and so they copy/paste as you expect. In the output file the font does not have a MacRoman encoding, and so the character codes are treated as ASCII.

The character code is 0x8E, Octal 216, which in Windows Encoding is a Zcaron, and in MacRoman encoding is eacute. So that's why you are getting a Zcaron instead of an eacute.
Comment 3 Vincent Lefevre 2023-12-08 11:17:17 UTC
(In reply to Ken Sharp from comment #2)
> A quick perusal of the actual input file reveals that the font in question
> has no ToUnicode CMap. There is, therefore, no Unicode information for that
> font at all.
> 
> In the absence of a ToUnicode CMap copy/paste from PDF files is unreliable.

This is incorrect. An alternative way is to use the glyph name. This is not as reliable as a ToUnicode CMap, but this seems to work well in practice, at least under Linux for most characters. I have many PDF files without a ToUnicode CMap (partly due to other bugs in Ghostscript, which doesn't always regenerate the ToUnicode CMap), and characters can be interpreted by pdftotext and various PDF readers I could try without any problem, except for some special math characters. That said, it is possible that they all do that via the poppler library, which has a table to convert the glyph names to Unicode (poppler/NameToUnicodeTable.h).

That said, I can't see glyph names such as "eacute" for "é" (in "Société") in the file provided by James (after I uncompressed the streams with "qpdf --stream-data=uncompress"). I suspect that both the absence of ToUnicode CMap and the glyph names yield the issue with Ghostscript.

> In the absence of a ToUnicode CMap any PDF consumer is reduced to guessing
> what the character codes represent. With your input file the 'unusual'
> glyphs are encoded in a font which uses a MacRoman encoding, and so they
> copy/paste as you expect. In the output file the font does not have a
> MacRoman encoding, and so the character codes are treated as ASCII.

But knowing the encoding of the fonts in the input PDF file, couldn't Ghostscript generate a ToUnicode CMap corresponding to this encoding (either by default or with a specific option)?

BTW, I also have such a file from MAIF (it has the same page layout, except that the margins are different, there are no glyph names either, and the fonts all from the Arial family, except one, which is Verdana), but in my case:
* the accented characters are preserved by "gs -sDEVICE=pdfwrite" (in the sense that pdftotext still gives the expected ones);
* the PDF file is much smaller (45 KB);
* WinAnsi is used for all fonts, both in the original file and the one generated by Ghostscript.
Comment 4 Ken Sharp 2023-12-08 11:22:25 UTC
(In reply to Vincent Lefevre from comment #3)
> (In reply to Ken Sharp from comment #2)
> > A quick perusal of the actual input file reveals that the font in question
> > has no ToUnicode CMap. There is, therefore, no Unicode information for that
> > font at all.
> > 
> > In the absence of a ToUnicode CMap copy/paste from PDF files is unreliable.
> 
> This is incorrect. An alternative way is to use the glyph name.

Which is still unreliable, as I said. Only the presence of a ToUnicode CMap is reliable (and not even then, I've seen files with incorrect ToUnicode CMaps).

Using the Encoding or glyph names is **not** reliable, as I said.


> But knowing the encoding of the fonts in the input PDF file, couldn't
> Ghostscript generate a ToUnicode CMap corresponding to this encoding (either
> by default or with a specific option)?

Could, yes, not going to. That would be unjustified guessing. There's no reason  to assume that the Encoding is relevant or any better than just leaving the character codes untouched.
Comment 5 Vincent Lefevre 2023-12-08 11:45:40 UTC
(In reply to Ken Sharp from comment #4)
> > But knowing the encoding of the fonts in the input PDF file, couldn't
> > Ghostscript generate a ToUnicode CMap corresponding to this encoding (either
> > by default or with a specific option)?
> 
> Could, yes, not going to. That would be unjustified guessing. There's no
> reason  to assume that the Encoding is relevant or any better than just
> leaving the character codes untouched.

Well, according to https://stackoverflow.com/a/29468049/3782797 5 encodings are specified by the official PDF specifications. So, IMHO, they should be honored (at least via an option).

Or perhaps, for fonts with one of these specified encodings, do not change the font to one that has a different encoding (such a change was what happened here: "With your input file the 'unusual' glyphs are encoded in a font which uses a MacRoman encoding [...] In the output file the font does not have a MacRoman encoding").
Comment 6 Ken Sharp 2023-12-08 11:56:31 UTC
(In reply to Vincent Lefevre from comment #5)

> > Could, yes, not going to. That would be unjustified guessing. There's no
> > reason  to assume that the Encoding is relevant or any better than just
> > leaving the character codes untouched.
> 
> Well, according to https://stackoverflow.com/a/29468049/3782797 5 encodings
> are specified by the official PDF specifications. So, IMHO, they should be
> honored (at least via an option).

No, because it's still just guessing. If there's no ToUnicode then anything else is a guess (in terms of using the Encoding to generate a ToUnicode).

We don't guess at ToUnicode CMaps.

 
> Or perhaps, for fonts with one of these specified encodings, do not change
> the font to one that has a different encoding

Then we wouldn't be able to 'consolidate' fonts, reducing the number of subset fonts and therefore the file size. We know from other reports that this is a desirable feature.

It's also not as easy to do this as you probably expect. The front and back ends are isolated from each other, deliberately in order that different interpreters can be used with the same device.

If you want to supply a patch I'll take it under consideration, but we certainly won't do anything which makes this impossible.
Comment 7 Vincent Lefevre 2023-12-08 12:55:34 UTC
(In reply to Ken Sharp from comment #6)
> (In reply to Vincent Lefevre from comment #5)
> > Well, according to https://stackoverflow.com/a/29468049/3782797 5 encodings
> > are specified by the official PDF specifications. So, IMHO, they should be
> > honored (at least via an option).
> 
> No, because it's still just guessing. If there's no ToUnicode then anything
> else is a guess (in terms of using the Encoding to generate a ToUnicode).

I don't understand what you mean by "guessing". That's an encoding from the official PDF specification, so that it already defines a mapping from a code to a character, and from this point of view, a ToUnicode would not be needed. If the intent is to use a code without any mapping, then the custom encoding should be chosen.

The only potentially problematic issue with the generation of a ToUnicode CMap is that this would add more specific information, which wasn't present in the input file. But note that I suggested that this could be added via an option, so that the user would decide what to do eventually.

> > Or perhaps, for fonts with one of these specified encodings, do not change
> > the font to one that has a different encoding
> 
> Then we wouldn't be able to 'consolidate' fonts, reducing the number of
> subset fonts and therefore the file size. We know from other reports that
> this is a desirable feature.

Yes, in general. But in these reports, was the text interpretation broken like here?

IMHO, allowing correct copy-paste and searching for text is also a desirable feature.

In any case, I think that an option to add a ToUnicode CMap (or glyph names, if assumed to be safer in such a particular case) would be a better solution than trying to preserve the font encodings.
Comment 8 Ken Sharp 2023-12-08 13:08:47 UTC
(In reply to Vincent Lefevre from comment #7)

> I don't understand what you mean by "guessing". That's an encoding from the
> official PDF specification, so that it already defines a mapping from a code
> to a character, and from this point of view, a ToUnicode would not be
> needed. If the intent is to use a code without any mapping, then the custom
> encoding should be chosen.

I'm not going to continue this further. I don't plan to do this, if you want to see it done then submit a patch. If, as I very strongly suspect, it causes problems then I'll be able to give you hard evidence.



> Yes, in general. But in these reports, was the text interpretation broken
> like here?

The text interpretation is not broken. PDF had no original intention of being able to copy/paste text. If it did then the ToUnicode CMap would have been mandatory.

The ability to get Unicode mappings from a PDF file is not an original goal and is never guaranteed to be possible. PDF is intended for viewing and the goal is that it should view the same across platforms, the text interpretation here works perfectly, the PDF can be viewed by other viewers correctly.

 
> IMHO, allowing correct copy-paste and searching for text is also a desirable
> feature.
> 
> In any case, I think that an option to add a ToUnicode CMap (or glyph names,
> if assumed to be safer in such a particular case) would be a better solution
> than trying to preserve the font encodings.

I disagree. The ToUnicode CMap is more or less a guarantee that the mapping is correct, and can be relied upon, whereas using other information is guesswork.

You are essentially turning a heuristic attempt to guess at the Unicode values into a statement that these are the correct values.

As I said, I do not plan to continue arguing, because its just a waste of my time. If you want to see this changed go ahead and submit a patch for consideration.