Created attachment 21839 [details] testcase Following bug 704478 and bug 704674, there still seems to be an issue with the current master (f3d80c26c4916ba112bf1365d2d03e6473c542a9). The LaTeX source: \documentclass{article} \usepackage[T1]{fontenc} \usepackage{lmodern} \pdfglyphtounicode{Scaron}{0160} \pdfgentounicode=1 \begin{document} \thispagestyle{empty} 'ê \end{document} Compiled with pdflatex from TeX Live 2020 (so that the only ToUnicode entry should be for Scaron), I get a PDF file (chartest7.pdf, attached) on which pdftotext gives "’ê" as expected. But ps2pdf on this PDF file generates a PDF file on which pdftotext gives "Šê", which is unexpected. Note that both \pdfglyphtounicode{Scaron}{0160} and the ê character have an influence on the ps2pdf result.
It seems that when using a ligature (such as ff as in bug 704674 comment 4), this now makes the bug disappear. But perhaps the cause is that a ToUnicode CMap is no longer generated due to a \pdfglyphtounicode with more than 2 bytes. In this case, the fix for the ligatures with associated \pdfglyphtounicode just made the bug no longer visible in such particular cases, but the real bug I was actually seeing was still there.
(In reply to Vincent Lefevre from comment #0) > Compiled with pdflatex from TeX Live 2020 (so that the only ToUnicode entry > should be for Scaron), [...] Actually, with information obtained with "qpdf --stream-data=uncompress", there are 2 entries given by pdflatex: 0 beginbfrange endbfrange 2 beginbfchar <20> <2423> <92> <0160> endbfchar and ps2pdf generates 1 beginbfrange <27><27><0160> endbfrange Note that if I replace \pdfglyphtounicode{Scaron}{0160} by \pdfglyphtounicode{A}{0041}, I get 0 beginbfrange endbfrange 2 beginbfchar <20> <2423> <41> <0041> endbfchar and ps2pdf generates 1 beginbfrange <27><27><2019> endbfrange which is OK. If I change the "ê" character by a "e", pdflatex gives the same CMap, but ps2pdf doesn't emit a CMap (so everything is fine, because the tools know how to interpret /quoteright).
Just guessing... It seems that Ghostscript assumes that /quoteright corresponds to <92> (= 146) like in Windows-1252 (WinAnsiEncoding?), thus it takes the <92> <0160> from the source ToUnicode CMap to generate its ToUnicode CMap. Hence the Š (which is U+0160) in place of the right quote. But the PDF generated by pdflatex has "39 /quoteright" in its /Differences, so that the above assumption would be incorrect.
Under Debian/unstable, this bug is still reproducible with the ghostscript 9.55.0~dfsg-3 Debian package, but no longer reproducible with the ghostscript 9.56.0~dfsg-1 Debian package. So it seems that this only remaining bug has been fixed. And I couldn't see any regression on my other ToUnicode CMap tests.
(In reply to Vincent Lefevre from comment #4) > Under Debian/unstable, this bug is still reproducible with the ghostscript > 9.55.0~dfsg-3 Debian package, but no longer reproducible with the > ghostscript 9.56.0~dfsg-1 Debian package. So it seems that this only > remaining bug has been fixed. No, nothing has changed here.
(In reply to Ken Sharp from comment #5) > (In reply to Vincent Lefevre from comment #4) > > Under Debian/unstable, this bug is still reproducible with the ghostscript > > 9.55.0~dfsg-3 Debian package, but no longer reproducible with the > > ghostscript 9.56.0~dfsg-1 Debian package. So it seems that this only > > remaining bug has been fixed. > > No, nothing has changed here. Ah, except that we're using a totally different PDF interpreter, and the ToUnicode stuff is basically less functional. So you're probably falling back to no ToUnicode at all or something. In any event, this is yet to be addressed.
(In reply to Ken Sharp from comment #6) > Ah, except that we're using a totally different PDF interpreter, and the > ToUnicode stuff is basically less functional. So you're probably falling > back to no ToUnicode at all or something. Indeed, if I do "ps2pdf chartest7.pdf out.pdf" and look at both PDF files after "qpdf --stream-data=uncompress", there is a ToUnicode CMap in chartest7.pdf, but not in out.pdf, which is fine for me, since there are only usual characters, so that a CMap is not needed (the PDF readers can get the characters from their default rules). And with -dNEWPDF=false (to use the old PDF interpreter), the bug reappears. I recall that the issue was that I got an *incorrect* CMap. So the bug was actually in the PDF interpreter. I've updated the bug title and the component from PDF Writer to PDF Interpreter to match the real cause. > In any event, this is yet to be addressed. Yes, but I think that this would be a new enhancement (until now, I don't think that I ever needed a CMap -- the fact that there is one in the PDF file is just because pdflatex now generates one by default, but I believe that in most cases, it is not needed). So it is fine if this bug is closed (I suppose that there will be no attempt to fix the old PDF interpreter).
(In reply to Vincent Lefevre from comment #7) > So the bug was actually in the PDF interpreter. I've updated the bug title > and the component from PDF Writer to PDF Interpreter to match the real cause. I'd rather you didn't, since the issue is not as simple as that. > So it is fine if this bug is closed (I suppose that there will be no attempt > to fix the old PDF interpreter). No, we won't be addressing the old interpreter. However the fact that the current code does not emit a ToUnicode CMap is incorrect, and needs to be fixed. At which point the problem will reappear, so can we just leave this bug alone now please ?
I can no longer reproduce this issue with either: * ghostscript 10.0.0~dfsg-11+deb12u2 in Debian/stable (12.2) * ghostscript 10.02.1~dfsg-1 in Debian/unstable i.e. I get ’ê as expected. From what I can see, Ghostscript no longer generates a ToUnicode CMap on this testcase. I've also tested other files from which I had derived this simple testcase, and I can no longer see any issue either. Has the bug eventually been fixed?
Hmm... Actually the issue was visible only with the old PDF interpreter, which has now been dropped entirely.
(In reply to Vincent Lefevre from comment #9) > I've also tested other files from which I had derived this simple testcase, > and I can no longer see any issue either. > > Has the bug eventually been fixed? Not specifically, but there have been changes. (In reply to Vincent Lefevre from comment #10) > Hmm... Actually the issue was visible only with the old PDF interpreter, > which has now been dropped entirely. Well we don't want to close the bug for now. Improving the whole ToUnicode preservation/generation is still a feature we want to work, if time ever permits, and this bug remains part of the tests for eventually doing that.
*** Bug 707369 has been marked as a duplicate of this bug. ***