Bug 704674 - pdfwrite and the PDF interpreter do not preserve ToUnicode values > 2 bytes
Summary: pdfwrite and the PDF interpreter do not preserve ToUnicode values > 2 bytes
Status: UNCONFIRMED
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Writer (show other bugs)
Version: unspecified
Hardware: PC Linux
: P4 enhancement
Assignee: Ken Sharp
URL:
Keywords:
: 706533 (view as bug list)
Depends on:
Blocks:
 
Reported: 2021-10-28 12:05 UTC by Vincent Lefevre
Modified: 2023-04-04 08:23 UTC (History)
1 user (show)

See Also:
Customer:
Word Size: ---


Attachments
testcase (24.68 KB, application/pdf)
2021-10-28 12:05 UTC, Vincent Lefevre
Details
minimal testcase (20.27 KB, application/pdf)
2021-11-01 13:53 UTC, Vincent Lefevre
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Vincent Lefevre 2021-10-28 12:05:18 UTC
Created attachment 21824 [details]
testcase

The ToUnicode CMap of PDF files is no longer preserved by pdfwrite. On the attached testcase chartest5a-tl2021.pdf, pdftotext gives:

Test: « don’t finite float offer affine ».

But on the PDF file generated by "ps2pdf chartest5a-tl2021.pdf out.pdf" with Ghostscript from master (after bug 704478 has been fixed), pdftotext gives:

Test: ń donŠt Ąnite Ćoat offer affine ż.

On the PDF file generated by ps2pdf from Ghostscript 9.27 (Debian/oldstable), pdftotext gives:

Test: « don’t finite float offer affine ».

as expected, showing a regression with the latest Ghostscript versions (including 9.53.3).
Comment 1 Ken Sharp 2021-10-28 12:55:56 UTC
(In reply to Vincent Lefevre from comment #0)

> The ToUnicode CMap of PDF files is no longer preserved by pdfwrite.

I don't believe it ever did. The ToUnicode CMap is not preserved, it is regenerated.

So I'm uncertain whether you want to report this as a bug (I see you've gone back quite a bit further in versions) or as an enhancement.


> attached testcase chartest5a-tl2021.pdf, pdftotext gives:
> 
> Test: « don’t finite float offer affine ».

Don't bother to change anything now, but it would help in future to keep files simple. If you've got multiple characters then multiple single files would be nicer. Extraneous characters have to be removed to make the file easier to follow.
Comment 2 Vincent Lefevre 2021-10-28 13:20:04 UTC
(In reply to Ken Sharp from comment #1)
> (In reply to Vincent Lefevre from comment #0)
> > The ToUnicode CMap of PDF files is no longer preserved by pdfwrite.
> 
> I don't believe it ever did. The ToUnicode CMap is not preserved, it is
> regenerated.

Well, there is a regression compared to Ghostscript 9.27. Then, I don't know the exact cause. But note that the bug seems to occur *only* when there is some *particular* ToUnicode CMap: I couldn't see any issue on PDF files generated by TeX Live before 2021 (even when using my own ToUnicode CMap).

> > attached testcase chartest5a-tl2021.pdf, pdftotext gives:
> > 
> > Test: « don’t finite float offer affine ».
> 
> Don't bother to change anything now, but it would help in future to keep
> files simple. If you've got multiple characters then multiple single files
> would be nicer. Extraneous characters have to be removed to make the file
> easier to follow.

I could try to produce simple testcases with single characters. But note that some breakages occur only due to multiple characters. For instance, with

  Test: float.

"fl" is corrected handled by Ghostscript (from master). But with

  Test: « don’t finite float offer affine ».

"fl" is transformed to "Ć" by the same Ghostscript version. So I'm a bit confused (I don't know what TeX Live does exactly and don't have a tool to get ToUnicode CMap information).
Comment 3 Ken Sharp 2021-10-29 08:43:44 UTC
(In reply to Vincent Lefevre from comment #2)

> I could try to produce simple testcases with single characters. But note
> that some breakages occur only due to multiple characters. For instance, with
> 
>   Test: float.
> 
> "fl" is corrected handled by Ghostscript (from master). But with
> 
>   Test: « don’t finite float offer affine ».
> 
> "fl" is transformed to "Ć" by the same Ghostscript version. So I'm a bit
> confused (I don't know what TeX Live does exactly and don't have a tool to
> get ToUnicode CMap information).

That's because it mattered whether the file used a glyph which could be represented by 2 bytes or not. To observe the problem you needed one which could not be so represented.

When I say 'simpler' I don't necessarily mean a single glyph (though that's the ideal), but wrapping the required text to demonstrate the problem with a bunch of other text (eg Test:) to make it comprehensible actually makes it harder to debug.

Commit 8f62213019bc682eeb0ed9467d8841f3770cfda6 fixes the 'bug' portion of this.


The pdfwrite device and PDF interpreter combination cannot currently pass ToUnicode CMap information which requires more than 2 bytes to represent. This is because it was written against the original definition of ToUnicode CMaps, which differed significantly from the current definition.

Altering this will require changes to both the PDF interpreter and the pdfwrite device; we won't be making those changes in the current PDF interpreter because it is obselescent. I've discussed it with the engineer doing the font work on the new PDF interpreter and we'll address this when we can. Again, this won't be soon. There is no likelihood of this being looked at again before mid 2022 at a fairly unreliable guess.

In the meantime I'm moving this to be an enhancement and altering the title
Comment 4 Vincent Lefevre 2021-11-01 13:53:22 UTC
Created attachment 21836 [details]
minimal testcase

Attached: a minimal testcase obtained with pdflatex on

\documentclass{article}
\usepackage{lmodern}
\pdfglyphtounicode{Scaron}{0160}
\pdfglyphtounicode{ff}{0066 0066}
\pdfgentounicode=1
\begin{document}
\thispagestyle{empty}
' ff
\end{document}

(with these \pdfglyphtounicode commands, the TeX Live version should no longer matter).

On this testcase, the ’ character (U+2019 RIGHT SINGLE QUOTATION MARK) is changed to Š (U+0160 LATIN CAPITAL LETTER S WITH CARON).
Comment 5 Ken Sharp 2021-11-01 14:09:52 UTC
(In reply to Vincent Lefevre from comment #4)

> On this testcase, the ’ character (U+2019 RIGHT SINGLE QUOTATION MARK) is
> changed to Š (U+0160 LATIN CAPITAL LETTER S WITH CARON).

Since the commit referenced in comment #3 pdfwrite no longer, for me at least, writes a ToUnicode CMap for the font in this file. Thus the character is not changed (except in as much as there is now no ToUnicode information).

Acrobat (X Pro) is capable of finding the quoteright in the output PDF file by searching for an apostrophe, and of finding the ff ligature by searching for ff or f.

I believe, as I said in comment #3, that this is the same behaviour seen in earlier versions of Ghostscript; when we need more than 2 bytes to store the ToUnicode value we do not preserve the ToUnicode information.

So as I said, the bug portion of this is complete. The enhancement is to alter both the PDF interpreter and the pdfwrite device so that we can pass ToUnicode information which uses more than 2 bytes for the representation.
Comment 6 Vincent Lefevre 2021-11-01 21:49:04 UTC
(In reply to Ken Sharp from comment #5)
> (In reply to Vincent Lefevre from comment #4)
> 
> > On this testcase, the ’ character (U+2019 RIGHT SINGLE QUOTATION MARK) is
> > changed to Š (U+0160 LATIN CAPITAL LETTER S WITH CARON).
> 
> Since the commit referenced in comment #3 pdfwrite no longer, for me at
> least, writes a ToUnicode CMap for the font in this file. Thus the character
> is not changed (except in as much as there is now no ToUnicode information).

Sorry, I hadn't noticed that this commit was a new one (git hashes don't give date/ordering information, so that this wasn't obvious), and I was testing against a slightly older version: the one I used for this bug report. The interesting thing is that the original testcase is also handled correctly with this commit:

zira% ps2pdf chartest5a-tl2021.pdf out.pdf
zira% pdftotext out.pdf -                 
Test: « don’t finite float offer affine ».

But indeed, there are issues with my usual .tex files (which use math symbols) when I include glyphtounicode.tex; however, this was already the case when I tried to use it one year ago (and I thought I did some mistake, so I didn't go further at that time).

> I believe, as I said in comment #3, that this is the same behaviour seen in
> earlier versions of Ghostscript; when we need more than 2 bytes to store the
> ToUnicode value we do not preserve the ToUnicode information.

I'm confused by the "more than 2 bytes to store the ToUnicode value". It seems that in LaTeX,

\pdfglyphtounicode{ffi}{0066 0066 0069}

does not yield any issue (or perhaps it appears to work just by luck), while if I understand correctly, there are 3 or 6 bytes (66 66 69 or 00 66 00 66 00 69?).

I could also check that

\pdfglyphtounicode{lessorequalslant}{2A7D}
\pdfglyphtounicode{greaterorequalslant}{2A7E}

and some other similar ones seem to be handled correctly after ps2pdf (do they need 2 bytes, or 3 bytes if UTF-8 is used?).

So it may be possible that with the above commit, glyphtounicode.tex could be simplified just a bit to make the pdflatex output work with Ghostscript.

Thanks a lot for the information.
Comment 7 Ken Sharp 2021-11-02 08:11:30 UTC
(In reply to Vincent Lefevre from comment #6)

> Sorry, I hadn't noticed that this commit was a new one (git hashes don't
> give date/ordering information, so that this wasn't obvious)

No, but you can click that SHA and it will take you to the commit log, which has a date and time stamp.


> > I believe, as I said in comment #3, that this is the same behaviour seen in
> > earlier versions of Ghostscript; when we need more than 2 bytes to store the
> > ToUnicode value we do not preserve the ToUnicode information.
> 
> I'm confused by the "more than 2 bytes to store the ToUnicode value". It
> seems that in LaTeX,
> 
> \pdfglyphtounicode{ffi}{0066 0066 0069}
> 
> does not yield any issue (or perhaps it appears to work just by luck), while
> if I understand correctly, there are 3 or 6 bytes (66 66 69 or 00 66 00 66
> 00 69?).

There are 6 bytes 0x00 0x66 0x00 0x66 0x00 0x69

As I keep saying, the code now does not emit a ToUnicode CMap in the output file. This is because the code was written to conform to the original definition of a ToUnicode CMap, which was not the same as a regular CMap and could not store more than 2 bytes as a Unicode code point.

And that is why the bug is still open as an enhancement. Resolving that will take work in both the PDF interpreter and the pdfwrite device and we don't have time to undertake it right at the moment.

Now, in the absence of a ToUnicode CMap a PDF consumer is left with the character code and the glyph name. The glyph names in this particular font are things like /quoteright, /ff, /ffi etc and the character codes happen to line up with the commonly used extended ASCII values.

I don't know which approach the consumers are using, and they may differ, but essentially yes, it's working due to luck (or heuristics if you prefer), just as it always did.


> \pdfglyphtounicode{lessorequalslant}{2A7D}
> \pdfglyphtounicode{greaterorequalslant}{2A7E}
> 
> and some other similar ones seem to be handled correctly after ps2pdf (do
> they need 2 bytes, or 3 bytes if UTF-8 is used?).

UTF-8 isn't relevant. Those are 2 byte code points.
Comment 8 Ken Sharp 2023-04-04 08:23:24 UTC
*** Bug 706533 has been marked as a duplicate of this bug. ***