Bug 705246 - new PDF interpreter may yield an incorrect ToUnicode CMap with the presence of U+2308 LEFT CEILING in input
Summary: new PDF interpreter may yield an incorrect ToUnicode CMap with the presence o...
Status: RESOLVED WORKSFORME
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Interpreter (show other bugs)
Version: 9.56.1
Hardware: PC Linux
: P4 normal
Assignee: Chris Liddell (chrisl)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-04-21 15:17 UTC by Vincent Lefevre
Modified: 2022-10-29 00:44 UTC (History)
1 user (show)

See Also:
Customer:
Word Size: ---


Attachments
archive containing the PDF files obtained with the chartest9 script (11.43 KB, application/x-xz)
2022-04-21 15:17 UTC, Vincent Lefevre
Details
input PDF (from the archive) file that yields the issue (4.25 KB, application/pdf)
2022-04-21 15:19 UTC, Vincent Lefevre
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Vincent Lefevre 2022-04-21 15:17:31 UTC
Created attachment 22438 [details]
archive containing the PDF files obtained with the chartest9 script

When an input PDF file has a character like U+2308 LEFT CEILING and has a ToUnicode CMap, the new PDF interpreter may yield an incorrect ToUnicode CMap in the generated PDF.

Here's a shell script used for some testing:

────────────────────────────────────────────────────────────────────────
#!/bin/sh

set -e

out()
{
  echo -n "$i$j ($1):"
  printf " %s" $(pdftotext chartest9$i$j$2.pdf - | tr -d '\f')
  echo
}

for i in a b
do
  for j in 0 1
  do
    cat <<'EOF' | sed "s/:$i/\\\\lceil/" | \
                  sed "s/:a//" | \
                  sed "s/J/$j/" > chartest9.tex
\documentclass{article}
\usepackage[T1]{fontenc}
\usepackage{lmodern}
\pdfgentounicode=J
\begin{document}
\thispagestyle{empty}
$\in:a$
\end{document}
EOF

    pdflatex chartest9.tex > /dev/null
    mv chartest9.pdf chartest9$i$j.pdf
    out "pdfTeX" ""

    ps2pdf14 chartest9$i$j.pdf chartest9$i$j-new.pdf
    out "gs new" "-new"

    ps2pdf14 -dNEWPDF=false chartest9$i$j.pdf chartest9$i$j-old.pdf
    out "gs old" "-old"
  done
done
────────────────────────────────────────────────────────────────────────

I've attached an archive containing the obtained PDF files.

4 kinds of PDF inputs are tested (a0, a1, b0, b1), where
  * a: the content corresponds to "∈⌈" (ELEMENT OF + LEFT CEILING)
  * b: the content corresponds to "∈" (ELEMENT OF)
  * 0: \pdfgentounicode=0 (pdfTeX does not generate a ToUnicode CMap)
  * 1: \pdfgentounicode=1 (pdfTeX generates a ToUnicode CMap)

I've compared (see above script for details):
  * pdfTeX: PDF file generated by pdfTeX from TeX Live 2022
  * gs new: PDF file obtained with the new PDF interpreter (default)
  * gs old: PDF file obtained with the old PDF interpreter (dNEWPDF=false)

I've done the tests with the ghostscript 9.56.1~dfsg-1 Debian package.

If LEFT CEILING is not present, Ghostscript does not generate a ToUnicode CMap in all of these cases, which is fine. But if this character is present:

1. With the old PDF interpreter, Ghostscript generates a correct ToUnicode CMap.

2. With the new PDF interpreter and no input ToUnicode CMap, Ghostscript does not generate a ToUnicode CMap (the only practical issue is that one cannot get unual characters like LEFT CEILING, but this is not worse than what TeX Live 2022 can yield in any case).

3. With the new PDF interpreter and an input ToUnicode CMap like the one from TeX Live 2022, Ghostscript generates an incorrect ToUnicode CMap, which prevents one from getting usual math characters such as ELEMENT OF.

The results, where I've added ToUnicode CMap information (which I have obtained with "qpdf --stream-data=uncompress" on these PDF files):

a0 (pdfTeX): ∈d (no CMap)
a0 (gs new): ∈d (no CMap)
a0 (gs old): ∈⌈ (CMap old)
a1 (pdfTeX): ∈d (CMap 1)
a1 (gs new):    (CMap 1-new)
a1 (gs old): ∈⌈ (CMap old)
b0 (pdfTeX): ∈  (no CMap)
b0 (gs new): ∈  (no CMap)
b0 (gs old): ∈  (no CMap)
b1 (pdfTeX): ∈  (CMap 1)
b1 (gs new): ∈  (no CMap)
b1 (gs old): ∈  (no CMap)

with the following ToUnicode CMaps:

CMap old:
────────────────────────────────────────
begincmap
/CMapType 2 def
/CMapName/R11 def
1 begincodespacerange
<00><ff>
endcodespacerange
2 beginbfrange
<32><32><2208>
<64><64><2308>
endbfrange
endcmap
────────────────────────────────────────

CMap 1:
────────────────────────────────────────
begincmap
/CIDSystemInfo
<< /Registry (TeX)
/Ordering (lmsy10-lm-mathsy)
/Supplement 0
>> def
/CMapName /TeX-lmsy10-lm-mathsy-0 def
/CMapType 2 def
1 begincodespacerange
<00> <FF>
endcodespacerange
0 beginbfrange
endbfrange
0 beginbfchar
endbfchar
endcmap
────────────────────────────────────────

CMap 1-new:
────────────────────────────────────────
begincmap
/CMapType 2 def
/CMapName/R11 def
1 begincodespacerange
<00><ff>
endcodespacerange
2 beginbfrange
<32><32><00>
<64><64><00>
endbfrange
endcmap
────────────────────────────────────────
Comment 1 Vincent Lefevre 2022-04-21 15:19:13 UTC
Created attachment 22439 [details]
input PDF (from the archive) file that yields the issue
Comment 2 Chris Liddell (chrisl) 2022-04-22 08:13:39 UTC
Not generating a ToUnicode when none exists in the output is (for me) a conscious decision. The reason being that, in that case, we're basically guessing what the contents should be, and we're guessing based on the same information that is in the output file and thus available to subsequent interpreters. Personally, I feel that kind of heuristic should be left to the final consumer.

The current code in git doesn't produce a ToUnicode when converting chartest9a1-uc.pdf doesn't produce a ToUnicode CMap - since the ToUnicode in the input contains no actual information, that seems to fall into the "no ToUnicode" case.

The only change in behaviour in that area that I can recall was this commit:

https://git.ghostscript.com/?p=ghostpdl.git;a=commitdiff;h=226cb507884b

so I would guess that's the cause of the difference.

It *seems* to now be behaving as I'd expect, but I won't close this yet in case I've misunderstood something.
Comment 3 Vincent Lefevre 2022-04-27 08:43:05 UTC
(In reply to Chris Liddell (chrisl) from comment #2)
> The only change in behaviour in that area that I can recall was this commit:
> 
> https://git.ghostscript.com/?p=ghostpdl.git;a=commitdiff;h=226cb507884b
> 
> so I would guess that's the cause of the difference.

I've just seen this comment (I did not receive the mail from Bugzilla, which might have been lost due to the major hardware failure of the storage of my VM that occurred at that time). I'll try to do some tests with this diff against the Debian package to see if this changes the behavior on my side (I'm rather busy ATM, fighting against all kinds of bugs). But first...

> It *seems* to now be behaving as I'd expect, but I won't close this yet in
> case I've misunderstood something.

You have just closed it. So do you have any additional information or do you just confirm?
Comment 4 Chris Liddell (chrisl) 2022-04-27 11:32:49 UTC
(In reply to Vincent Lefevre from comment #3)
<SNIP> 
> > It *seems* to now be behaving as I'd expect, but I won't close this yet in
> > case I've misunderstood something.
> 
> You have just closed it. So do you have any additional information or do you
> just confirm?

As I said, I only left it open for a few days in case I had misunderstood your description, I haven't looked into this any more.
Comment 5 Vincent Lefevre 2022-10-29 00:44:47 UTC
(In reply to Vincent Lefevre from comment #3)
> I've just seen this comment (I did not receive the mail from Bugzilla, which
> might have been lost due to the major hardware failure of the storage of my
> VM that occurred at that time). I'll try to do some tests with this diff
> against the Debian package to see if this changes the behavior on my side
> (I'm rather busy ATM, fighting against all kinds of bugs). But first...
[...]

I eventually forgot to do the tests. Anyway, with Ghostscript 10.0.0 now in Debian/unstable, I can see that this bug no longer occurs.