Created attachment 22438 [details] archive containing the PDF files obtained with the chartest9 script When an input PDF file has a character like U+2308 LEFT CEILING and has a ToUnicode CMap, the new PDF interpreter may yield an incorrect ToUnicode CMap in the generated PDF. Here's a shell script used for some testing: ──────────────────────────────────────────────────────────────────────── #!/bin/sh set -e out() { echo -n "$i$j ($1):" printf " %s" $(pdftotext chartest9$i$j$2.pdf - | tr -d '\f') echo } for i in a b do for j in 0 1 do cat <<'EOF' | sed "s/:$i/\\\\lceil/" | \ sed "s/:a//" | \ sed "s/J/$j/" > chartest9.tex \documentclass{article} \usepackage[T1]{fontenc} \usepackage{lmodern} \pdfgentounicode=J \begin{document} \thispagestyle{empty} $\in:a$ \end{document} EOF pdflatex chartest9.tex > /dev/null mv chartest9.pdf chartest9$i$j.pdf out "pdfTeX" "" ps2pdf14 chartest9$i$j.pdf chartest9$i$j-new.pdf out "gs new" "-new" ps2pdf14 -dNEWPDF=false chartest9$i$j.pdf chartest9$i$j-old.pdf out "gs old" "-old" done done ──────────────────────────────────────────────────────────────────────── I've attached an archive containing the obtained PDF files. 4 kinds of PDF inputs are tested (a0, a1, b0, b1), where * a: the content corresponds to "∈⌈" (ELEMENT OF + LEFT CEILING) * b: the content corresponds to "∈" (ELEMENT OF) * 0: \pdfgentounicode=0 (pdfTeX does not generate a ToUnicode CMap) * 1: \pdfgentounicode=1 (pdfTeX generates a ToUnicode CMap) I've compared (see above script for details): * pdfTeX: PDF file generated by pdfTeX from TeX Live 2022 * gs new: PDF file obtained with the new PDF interpreter (default) * gs old: PDF file obtained with the old PDF interpreter (dNEWPDF=false) I've done the tests with the ghostscript 9.56.1~dfsg-1 Debian package. If LEFT CEILING is not present, Ghostscript does not generate a ToUnicode CMap in all of these cases, which is fine. But if this character is present: 1. With the old PDF interpreter, Ghostscript generates a correct ToUnicode CMap. 2. With the new PDF interpreter and no input ToUnicode CMap, Ghostscript does not generate a ToUnicode CMap (the only practical issue is that one cannot get unual characters like LEFT CEILING, but this is not worse than what TeX Live 2022 can yield in any case). 3. With the new PDF interpreter and an input ToUnicode CMap like the one from TeX Live 2022, Ghostscript generates an incorrect ToUnicode CMap, which prevents one from getting usual math characters such as ELEMENT OF. The results, where I've added ToUnicode CMap information (which I have obtained with "qpdf --stream-data=uncompress" on these PDF files): a0 (pdfTeX): ∈d (no CMap) a0 (gs new): ∈d (no CMap) a0 (gs old): ∈⌈ (CMap old) a1 (pdfTeX): ∈d (CMap 1) a1 (gs new): (CMap 1-new) a1 (gs old): ∈⌈ (CMap old) b0 (pdfTeX): ∈ (no CMap) b0 (gs new): ∈ (no CMap) b0 (gs old): ∈ (no CMap) b1 (pdfTeX): ∈ (CMap 1) b1 (gs new): ∈ (no CMap) b1 (gs old): ∈ (no CMap) with the following ToUnicode CMaps: CMap old: ──────────────────────────────────────── begincmap /CMapType 2 def /CMapName/R11 def 1 begincodespacerange <00><ff> endcodespacerange 2 beginbfrange <32><32><2208> <64><64><2308> endbfrange endcmap ──────────────────────────────────────── CMap 1: ──────────────────────────────────────── begincmap /CIDSystemInfo << /Registry (TeX) /Ordering (lmsy10-lm-mathsy) /Supplement 0 >> def /CMapName /TeX-lmsy10-lm-mathsy-0 def /CMapType 2 def 1 begincodespacerange <00> <FF> endcodespacerange 0 beginbfrange endbfrange 0 beginbfchar endbfchar endcmap ──────────────────────────────────────── CMap 1-new: ──────────────────────────────────────── begincmap /CMapType 2 def /CMapName/R11 def 1 begincodespacerange <00><ff> endcodespacerange 2 beginbfrange <32><32><00> <64><64><00> endbfrange endcmap ────────────────────────────────────────
Created attachment 22439 [details] input PDF (from the archive) file that yields the issue
Not generating a ToUnicode when none exists in the output is (for me) a conscious decision. The reason being that, in that case, we're basically guessing what the contents should be, and we're guessing based on the same information that is in the output file and thus available to subsequent interpreters. Personally, I feel that kind of heuristic should be left to the final consumer. The current code in git doesn't produce a ToUnicode when converting chartest9a1-uc.pdf doesn't produce a ToUnicode CMap - since the ToUnicode in the input contains no actual information, that seems to fall into the "no ToUnicode" case. The only change in behaviour in that area that I can recall was this commit: https://git.ghostscript.com/?p=ghostpdl.git;a=commitdiff;h=226cb507884b so I would guess that's the cause of the difference. It *seems* to now be behaving as I'd expect, but I won't close this yet in case I've misunderstood something.
(In reply to Chris Liddell (chrisl) from comment #2) > The only change in behaviour in that area that I can recall was this commit: > > https://git.ghostscript.com/?p=ghostpdl.git;a=commitdiff;h=226cb507884b > > so I would guess that's the cause of the difference. I've just seen this comment (I did not receive the mail from Bugzilla, which might have been lost due to the major hardware failure of the storage of my VM that occurred at that time). I'll try to do some tests with this diff against the Debian package to see if this changes the behavior on my side (I'm rather busy ATM, fighting against all kinds of bugs). But first... > It *seems* to now be behaving as I'd expect, but I won't close this yet in > case I've misunderstood something. You have just closed it. So do you have any additional information or do you just confirm?
(In reply to Vincent Lefevre from comment #3) <SNIP> > > It *seems* to now be behaving as I'd expect, but I won't close this yet in > > case I've misunderstood something. > > You have just closed it. So do you have any additional information or do you > just confirm? As I said, I only left it open for a few days in case I had misunderstood your description, I haven't looked into this any more.
(In reply to Vincent Lefevre from comment #3) > I've just seen this comment (I did not receive the mail from Bugzilla, which > might have been lost due to the major hardware failure of the storage of my > VM that occurred at that time). I'll try to do some tests with this diff > against the Debian package to see if this changes the behavior on my side > (I'm rather busy ATM, fighting against all kinds of bugs). But first... [...] I eventually forgot to do the tests. Anyway, with Ghostscript 10.0.0 now in Debian/unstable, I can see that this bug no longer occurs.