Bug 704478 - pdfwrite emits incorrect ToUnicode entries
pdfwrite emits incorrect ToUnicode entries
 Status: UNCONFIRMED None Ghostscript Unclassified PDF Writer (show other bugs) master PC Linux P4 normal Ken Sharp Bug traffic

 Reported: 2021-10-01 01:07 UTC by Vincent Lefevre 2021-10-18 13:25 UTC (History) 2 users (show) robin.watts zauguin ---

Attachments
PDF testcase (12.29 KB, application/pdf)
2021-10-01 01:07 UTC, Vincent Lefevre
Details
incorrect PDF file generated by ps2pdf (4.61 KB, application/pdf)
2021-10-01 01:11 UTC, Vincent Lefevre
Details
PDF testcase 2 (22.40 KB, application/pdf)
2021-10-01 12:38 UTC, Vincent Lefevre
Details
Tests with various TeX Live and Ghostscript versions (124.51 KB, application/x-xz)
2021-10-03 23:17 UTC, Vincent Lefevre
Details

 Note You need to log in before you can comment on or make changes to this bug.
 Vincent Lefevre 2021-10-01 01:07:45 UTC Created attachment 21627 [details] PDF testcase When running ps2pdf on a PDF file, some characters may be replaced by completely different ones. The glyph is correct, but this makes text non-searchable and partly unreadable via pdftotext. On my testcase, the issue has been introduced by commit 4d91c6ad3e76e19f36d23a50dce253fbbc7d0560 (Update CFF strings "known" encoding in C) and is still not fixed in master. LaTeX source to generate the PDF testcase: \documentclass[12pt]{article} \usepackage[T1]{fontenc} \begin{document} \thispagestyle{empty} Test: float. \end{document} to be compiled with pdflatex. I've attached the generated PDF file "chartest.pdf". On this file, pdftotext gives "Test: float." as expected. But after executing "ps2pdf chartest.pdf chartest-gs.pdf", "pdftotext chartest-gs.pdf" gives "Test: ŕoat.", which is incorrect: "fl" has been replaced by "ŕ". Vincent Lefevre 2021-10-01 01:11:49 UTC Created attachment 21628 [details] incorrect PDF file generated by ps2pdf I've also attached the incorrect PDF file generated by ps2pdf: "chartest-gs.pdf". Note that removing "\usepackage[T1]{fontenc}" or the period after "float" in the LaTeX source makes the issue disappear. Vincent Lefevre 2021-10-01 12:38:50 UTC Created attachment 21639 [details] PDF testcase 2 The issue is actually older, as shown by this second testcase. The text is "Don’t ff.", but ps2pdf yields a PDF file with "DonŠt ff." (the bug was already present in October 2017). Vincent Lefevre 2021-10-01 13:55:40 UTC Note: To reproduce the issue from pdflatex, one needs a recent version, such as the one in Debian/unstable. So it seems that the issue partly comes from pdflatex. Ray Johnston 2021-10-01 16:17:40 UTC This may actually be an issue for Chris -- Ken can reassign if it is not in the pdfwrite device, but is a font issue. Ken Sharp 2021-10-01 18:24:48 UTC I suspect the problem here is that the ToUnicode CMap in the original file does not map the character code for the fl ligature to a single Unicode code point. Instead it maps it to two code points; 'f' and 'l'. We expect it to be one, which should be U+FB01. We (not unreasonably I feel) expect that a single character code should map to a single Unicode code point, the value here 0x66 0x6c is an approximation. In the second file the ff ligature maps to 'f' and 'f', again an approximation. I don't consider this to be a 'major' problem since the appearance of the PDF is unaffected by this Metadata. It will be some while before I can investigate this further. In passing, I believe the original file from TeX is technically invalid as it contains this in the Info dicitonary: /PTEX.Fullbanner (This is pdfTeX, Version 3.141592653-2.6-1.40.22 (TeX Live 2022/dev/Debian) kpathsea version 6.3.4/dev) That contains parentheses which have not been escaped. Vincent Lefevre 2021-10-01 22:58:08 UTC (In reply to Ken Sharp from comment #5) > I suspect the problem here is that the ToUnicode CMap in the original file > does not map the character code for the fl ligature to a single Unicode code > point. > > Instead it maps it to two code points; 'f' and 'l'. We expect it to be one, > which should be U+FB01. We (not unreasonably I feel) expect that a single > character code should map to a single Unicode code point, the value here > 0x66 0x6c is an approximation. The fl ligature semantically corresponds to the characters 'f' and 'l'. So, at some point, one needs to get these two characters back, so that they are searchable (and readable in a text terminal). Will the various applications do this? (With some PDF files, I can see characters like U+FB02 with pdftotext, and they are not searchable, so I have some doubts.) > I don't consider this to be a 'major' problem since the appearance of the > PDF is unaffected by this Metadata. This is a major issue for me (and the silent breakage makes this worse): being able to search PDF files is very important; this is also important for search engines when indexing PDF files. Converting the contents to text is also very important for some operations (diffs, etc.). > In passing, I believe the original file from TeX is technically invalid as > it contains this in the Info dicitonary: > > /PTEX.Fullbanner (This is pdfTeX, Version 3.141592653-2.6-1.40.22 (TeX Live > 2022/dev/Debian) kpathsea version 6.3.4/dev) > > That contains parentheses which have not been escaped. Indeed, I can see on an old PDF file that they were escaped in the past (but "qpdf --check" doesn't detect any issue). Robin Watts 2021-10-01 23:02:31 UTC (In reply to Vincent Lefevre from comment #6) > (In reply to Ken Sharp from comment #5) > > /PTEX.Fullbanner (This is pdfTeX, Version 3.141592653-2.6-1.40.22 (TeX Live > > 2022/dev/Debian) kpathsea version 6.3.4/dev) > > > > That contains parentheses which have not been escaped. > Indeed, I can see on an old PDF file that they were escaped in the past (but > "qpdf --check" doesn't detect any issue). You don't need to escape balanced parentheses in PDF strings. pdf_reference17.pdf, chapter 3, p54: Literal Strings A literal string is written as an arbitrary number of characters enclosed in parentheses. Any characters may appear in a string except unbalanced parentheses and the backslash, which must be treated specially. Balanced pairs of parentheses within a string require no special treatment. Ken Sharp 2021-10-02 08:23:42 UTC (In reply to Vincent Lefevre from comment #6) > The fl ligature semantically corresponds to the characters 'f' and 'l'. So, > at some point, one needs to get these two characters back, so that they are > searchable (and readable in a text terminal). Will the various applications > do this? (With some PDF files, I can see characters like U+FB02 with > pdftotext, and they are not searchable, so I have some doubts.) Not my problem, you should probably take that up with the PDF consumers. A Unicode ligature should be searchable either as a ligature or its components. > > I don't consider this to be a 'major' problem since the appearance of the > > PDF is unaffected by this Metadata. > > This is a major issue for me (and the silent breakage makes this worse): In the nicest possible way, that isn't important to me. While we attempt to preserve considerable non-visual information from the input, the goal of pdfwrite has always been that the output should visually match the input. Vincent Lefevre 2021-10-02 22:47:56 UTC (In reply to Ken Sharp from comment #8) > (In reply to Vincent Lefevre from comment #6) > > > The fl ligature semantically corresponds to the characters 'f' and 'l'. So, > > at some point, one needs to get these two characters back, so that they are > > searchable (and readable in a text terminal). Will the various applications > > do this? (With some PDF files, I can see characters like U+FB02 with > > pdftotext, and they are not searchable, so I have some doubts.) > > Not my problem, you should probably take that up with the PDF consumers. A > Unicode ligature should be searchable either as a ligature or its components. OK, it seems that the ligatures are searchable with Atril and xpdf (this was not the case in the past for xpdf, or perhaps this was a different issue). The pdftotext command from poppler still mishandles ligatures, but this seems unintended because there was an old bug fixed in 2012 about that, and the source still says: Expand ligatures in the Alphabetic Presentation Form block (eg "fi", "ffi") to normal form. So I've reported a new bug against poppler. Now, I don't understand why ps2pdf can sometimes handle a mapping to two code points, and sometimes it can't. Vincent Lefevre 2021-10-03 23:17:24 UTC Created attachment 21645 [details] Tests with various TeX Live and Ghostscript versions I've done additional tests with various TeX Live and Ghostscript versions, showing that issues started to appear when *both* TeX Live 2021 and Ghostscript 9.53 or later (or maybe some version between 9.27 and 9.53, which I did not test) are used. Now, I don't know yet whether the change in TeX Live 2021 was intended or could be regarded as a bug; however, I could not see any really bad consequence with Ghostscript 9.27. It might be possible that there is also some randomness involved, i.e. an issue that is present somewhere, but not visible in my tests. Details: My tests are based on 5 LaTeX source files, which use the lmodern fonts and contain * chartest3.tex: Test: « don't ». * chartest4[ab].tex: Test: « don't finite float ». * chartest5[ab].tex: Test: « don't finite float offer affine ». where the 4b and 5b versions also contain \pdfglyphtounicode commands for the ligatures (from glyphtounicode.tex), though the tests show that these commands do not have any influence here. For example, chartest3.tex is: \documentclass[12pt]{article} \usepackage[utf8]{inputenc} \usepackage[T1]{fontenc} \usepackage{lmodern} \begin{document} \thispagestyle{empty} Test: « don't ». \end{document} Tested TeX Live versions: * 2018: texlive-base 2018.20190227-2 (Debian 10 buster); * 2020: texlive-base 2020.20210202-3 (Debian 11 bullseye); * 2021: texlive-base 2021.20210921-1 (Debian unstable). For all PDF files generated by pdflatex, pdftotext gives the normal form (as in the LaTeX source, but with a curly apostrophe), i.e. for 3, 4[ab] and 5[ab] respectively: * Test: « don’t ». * Test: « don’t finite float ». * Test: « don’t finite float offer affine ». Tested Ghostscript versions: * ghostscript 9.27~dfsg-2+deb10u4 (Debian 10 buster); * ghostscript 9.53.3~dfsg-7+deb11u1 (Debian 11 bullseye); * ghostscript 9.54.0~dfsg-5 (Debian unstable). The ps2pdf utility from these Ghostscript versions shows 3 different behaviors: * Text left unchanged ("=" below). * Ligatures are introduced ("L" below): Test: « don’t ». [equivalent to unchanged] Test: « don’t ﬁnite ﬂoat ». Test: « don’t ﬁnite ﬂoat oﬀer aﬃne ». i.e. with U+FB01 LATIN SMALL LIGATURE FI, etc. * Characters get trashed ("X" below): Test: ń donŠt ż. Test: ń donŠt Ąnite Ćoat ż. Test: ń donŠt Ąnite Ćoat offer affine ż. (I recall that this is only in the text part; the glyphs are always OK.) Summary of results depending on the TeX Live and Ghostscript versions: 9.27 9.53 9.54 2018 = = L 2020 = = L 2021 = ==X X Note: ==X means that text in unchanged for 3 and 4[ab], but trashed for 5[ab]. As shown above, TeX Live 2018 and 2020 lead to the same results. Note also that these are results based on pdftotext from poppler; I don't know whether this utility converts ligatures to the normal form in some cases (but not always). All the PDF files are in the attached archive: * chartest*-tl20??.pdf for those obtained with pdflatex; * chartest*-tl20??-gs???.pdf for those obtained with ps2pdf from the above PDF files. Marcel Krüger 2021-10-16 18:27:24 UTC (In reply to Ken Sharp from comment #5) > I suspect the problem here is that the ToUnicode CMap in the original file > does not map the character code for the fl ligature to a single Unicode code > point. > > Instead it maps it to two code points; 'f' and 'l'. We expect it to be one, > which should be U+FB01. We (not unreasonably I feel) expect that a single > character code should map to a single Unicode code point, the value here > 0x66 0x6c is an approximation. > > In the second file the ff ligature maps to 'f' and 'f', again an > approximation. In the current PDF specification, Section 9.10.3 "ToUnicode CMaps" Example 2 is an example with a ToUnicode CMap mapping ligatures (ff, fl and ffl). There the same approach is used as in pdfLaTeX PDF files: Mapping ligatures to their components. While this is an example and not normative text, it does suggest that decomposing ligatures in ToUnicode maps is the expected behavior in PDF files. Especially since modern fonts often contain ligatures which do not have independent codepoint and therefore have to be decomposed, I think that it is only consistent to do the same for ligatures which are also encoded separately for legacy reasons. Ken Sharp 2021-10-16 18:49:48 UTC (In reply to Marcel Krüger from comment #11) > In the current PDF specification, Where possible I'd prefer to use the PDF 1,7 specification. It is freely available, unlike the ISO specification, which makes it much easier for others to refer to and it's also a lot easier to read. > Section 9.10.3 "ToUnicode CMaps" Example 2 > is an example with a ToUnicode CMap mapping ligatures (ff, fl and ffl). > There the same approach is used as in pdfLaTeX PDF files: Mapping ligatures > to their components. > > While this is an example and not normative text, it does suggest that > decomposing ligatures in ToUnicode maps is the expected behavior in PDF > files. I believe the example has been deliberately constructed this way precisely in order to demonstrate this kind of mapping. It's not too surprising that the authors of the specification should have chosen to write an example to demonstrate multiple points. I disagree that it implies any kind of expectation, there is nothing in the text to indicate that at all. For what it's worth the original behavior of pdfwrite did not include a ToUnicode CMap in the output file at all. The searchability of that file appears to be due to Adobe Acrobat being able to find the named glyph (/fi) in the font Encoding, or possibly because the character code happens to match the local encoding in use. Either way, hardly reliable. I've already said I will look into this, but it will not be soon as I have other things to work on which have a higher priority. Vincent Lefevre 2021-10-18 13:13:12 UTC Some additional information about the possible cause of the incorrect behavior: According to a discussion in the TeX Live mailing-list, what is new in TeX Live 2021 and would cause the change of behavior in Ghostscript (see results in Comment 10) is that it adds CMaps to the PDF by default: https://tug.org/pipermail/tex-live/2021-October/047488.html Indeed, with TeX Live 2021, \pdfgentounicode=0 (to disable that) avoids the issue. Actually, I think that the issue is caused by something particular in the CMap, not by the presence of a CMap itself, since with some of my LaTeX files, I was using my own CMap: \pdfglyphtounicode{ff}{0066 0066} \pdfglyphtounicode{ffi}{0066 0066 0069} \pdfglyphtounicode{ffl}{0066 0066 006C} \pdfglyphtounicode{fi}{0066 0069} \pdfglyphtounicode{fl}{0066 006C} \pdfgentounicode=1 (the goal being to avoid the ligatures in the PDF file), and with TeX Live previous to 2021, this code does not yield any issue with Ghostscript (even with the latest Ghostscript 9.54). This suggests that the issue in Ghostscript does not come from these ligatures, but from something else in the CMap. Ken Sharp 2021-10-18 13:20:40 UTC (In reply to Vincent Lefevre from comment #13) > According to a discussion in the TeX Live mailing-list, what is new in TeX > Live 2021 and would cause the change of behavior in Ghostscript (see results > in Comment 10) is that it adds CMaps to the PDF by default: Realistically it doesn't matter what changed in TeX, we should either preserve the ToUnicode information from the input file, or not emit a ToUnicode CMap which has the wrong code point for the ligatures. As I noted; previously we didn't emit a ToUnicode CMap, now we do (I suspect because we now correctly recognise a standard font encoding including a ligature). PDF consumers ought to trust the ToUnicode CMap over anything else so embedding a wrong one is bad.