Bug 704681

Summary: incorrect ToUnicode CMap interpretation, so that pdfwrite generates an incorrect ToUnicode CMap
Product: Ghostscript Reporter: Vincent Lefevre <vincent-gs>
Component: PDF InterpreterAssignee: Ken Sharp <ken.sharp>
Status: UNCONFIRMED ---    
Severity: normal CC: bruno.n.pagani, james
Priority: P4    
Version: master   
Hardware: PC   
OS: Linux   
Customer: Word Size: ---
Attachments: testcase

Description Vincent Lefevre 2021-11-03 03:35:07 UTC
Created attachment 21839 [details]
testcase

Following bug 704478 and bug 704674, there still seems to be an issue with the current master (f3d80c26c4916ba112bf1365d2d03e6473c542a9).

The LaTeX source:

\documentclass{article}
\usepackage[T1]{fontenc}
\usepackage{lmodern}
\pdfglyphtounicode{Scaron}{0160}
\pdfgentounicode=1
\begin{document}
\thispagestyle{empty}
'ê
\end{document}

Compiled with pdflatex from TeX Live 2020 (so that the only ToUnicode entry should be for Scaron), I get a PDF file (chartest7.pdf, attached) on which pdftotext gives "’ê" as expected. But ps2pdf on this PDF file generates a PDF file on which pdftotext gives "Šê", which is unexpected.

Note that both \pdfglyphtounicode{Scaron}{0160} and the ê character have an influence on the ps2pdf result.
Comment 1 Vincent Lefevre 2021-11-03 03:53:56 UTC
It seems that when using a ligature (such as ff as in bug 704674 comment 4), this now makes the bug disappear. But perhaps the cause is that a ToUnicode CMap is no longer generated due to a \pdfglyphtounicode with more than 2 bytes. In this case, the fix for the ligatures with associated \pdfglyphtounicode just made the bug no longer visible in such particular cases, but the real bug I was actually seeing was still there.
Comment 2 Vincent Lefevre 2021-11-03 09:57:58 UTC
(In reply to Vincent Lefevre from comment #0)
> Compiled with pdflatex from TeX Live 2020 (so that the only ToUnicode entry
> should be for Scaron), [...]

Actually, with information obtained with "qpdf --stream-data=uncompress", there are 2 entries given by pdflatex:

0 beginbfrange
endbfrange
2 beginbfchar
<20> <2423>
<92> <0160>
endbfchar

and ps2pdf generates

1 beginbfrange
<27><27><0160>
endbfrange

Note that if I replace \pdfglyphtounicode{Scaron}{0160} by \pdfglyphtounicode{A}{0041}, I get

0 beginbfrange
endbfrange
2 beginbfchar
<20> <2423>
<41> <0041>
endbfchar

and ps2pdf generates

1 beginbfrange
<27><27><2019>
endbfrange

which is OK.

If I change the "ê" character by a "e", pdflatex gives the same CMap, but ps2pdf doesn't emit a CMap (so everything is fine, because the tools know how to interpret /quoteright).
Comment 3 Vincent Lefevre 2021-11-03 10:48:39 UTC
Just guessing... It seems that Ghostscript assumes that /quoteright corresponds to <92> (= 146) like in Windows-1252 (WinAnsiEncoding?), thus it takes the <92> <0160> from the source ToUnicode CMap to generate its ToUnicode CMap. Hence the Š (which is U+0160) in place of the right quote.

But the PDF generated by pdflatex has "39 /quoteright" in its /Differences, so that the above assumption would be incorrect.
Comment 4 Vincent Lefevre 2022-03-31 13:53:30 UTC
Under Debian/unstable, this bug is still reproducible with the ghostscript 9.55.0~dfsg-3 Debian package, but no longer reproducible with the ghostscript 9.56.0~dfsg-1 Debian package. So it seems that this only remaining bug has been fixed.

And I couldn't see any regression on my other ToUnicode CMap tests.
Comment 5 Ken Sharp 2022-03-31 14:02:08 UTC
(In reply to Vincent Lefevre from comment #4)
> Under Debian/unstable, this bug is still reproducible with the ghostscript
> 9.55.0~dfsg-3 Debian package, but no longer reproducible with the
> ghostscript 9.56.0~dfsg-1 Debian package. So it seems that this only
> remaining bug has been fixed.

No, nothing has changed here.
Comment 6 Ken Sharp 2022-03-31 14:03:39 UTC
(In reply to Ken Sharp from comment #5)
> (In reply to Vincent Lefevre from comment #4)
> > Under Debian/unstable, this bug is still reproducible with the ghostscript
> > 9.55.0~dfsg-3 Debian package, but no longer reproducible with the
> > ghostscript 9.56.0~dfsg-1 Debian package. So it seems that this only
> > remaining bug has been fixed.
> 
> No, nothing has changed here.

Ah, except that we're using a totally different PDF interpreter, and the ToUnicode stuff is basically less functional. So you're probably falling back to no ToUnicode at all or something.

In any event, this is yet to be addressed.
Comment 7 Vincent Lefevre 2022-03-31 14:40:41 UTC
(In reply to Ken Sharp from comment #6)
> Ah, except that we're using a totally different PDF interpreter, and the
> ToUnicode stuff is basically less functional. So you're probably falling
> back to no ToUnicode at all or something.

Indeed, if I do "ps2pdf chartest7.pdf out.pdf" and look at both PDF files after "qpdf --stream-data=uncompress", there is a ToUnicode CMap in chartest7.pdf, but not in out.pdf, which is fine for me, since there are only usual characters, so that a CMap is not needed (the PDF readers can get the characters from their default rules).

And with -dNEWPDF=false (to use the old PDF interpreter), the bug reappears.
I recall that the issue was that I got an *incorrect* CMap.

So the bug was actually in the PDF interpreter. I've updated the bug title and the component from PDF Writer to PDF Interpreter to match the real cause.

> In any event, this is yet to be addressed.

Yes, but I think that this would be a new enhancement (until now, I don't think that I ever needed a CMap -- the fact that there is one in the PDF file is just because pdflatex now generates one by default, but I believe that in most cases, it is not needed).

So it is fine if this bug is closed (I suppose that there will be no attempt to fix the old PDF interpreter).
Comment 8 Ken Sharp 2022-03-31 14:43:34 UTC
(In reply to Vincent Lefevre from comment #7)

> So the bug was actually in the PDF interpreter. I've updated the bug title
> and the component from PDF Writer to PDF Interpreter to match the real cause.

I'd rather you didn't, since the issue is not as simple as that.


> So it is fine if this bug is closed (I suppose that there will be no attempt
> to fix the old PDF interpreter).

No, we won't be addressing the old interpreter. However the fact that the current code does not emit a ToUnicode CMap is incorrect, and needs to be fixed. At which point the problem will reappear, so can we just leave this bug alone now please ?
Comment 9 Vincent Lefevre 2023-11-24 15:53:53 UTC
I can no longer reproduce this issue with either:
* ghostscript 10.0.0~dfsg-11+deb12u2 in Debian/stable (12.2)
* ghostscript 10.02.1~dfsg-1 in Debian/unstable

i.e. I get ’ê as expected.

From what I can see, Ghostscript no longer generates a ToUnicode CMap on this testcase.

I've also tested other files from which I had derived this simple testcase, and I can no longer see any issue either.

Has the bug eventually been fixed?
Comment 10 Vincent Lefevre 2023-11-24 15:56:11 UTC
Hmm... Actually the issue was visible only with the old PDF interpreter, which has now been dropped entirely.
Comment 11 Ken Sharp 2023-11-24 16:04:09 UTC
(In reply to Vincent Lefevre from comment #9)

> I've also tested other files from which I had derived this simple testcase,
> and I can no longer see any issue either.
> 
> Has the bug eventually been fixed?

Not specifically, but there have been changes.


(In reply to Vincent Lefevre from comment #10)
> Hmm... Actually the issue was visible only with the old PDF interpreter,
> which has now been dropped entirely.

Well we don't want to close the bug for now. Improving the whole ToUnicode preservation/generation is still a feature we want to work, if time ever permits, and this bug remains part of the tests for eventually doing that.
Comment 12 Ken Sharp 2023-12-08 08:58:14 UTC
*** Bug 707369 has been marked as a duplicate of this bug. ***