Bug 707237

Summary: ps2pdf modifies ASCII text of a PDF file, breaking conversion to text and searching for text
Product: Ghostscript Reporter: Vincent Lefevre <vincent-gs>
Component: PDF WriterAssignee: Chris Liddell (chrisl) <chris.liddell>
Status: RESOLVED FIXED    
Severity: normal CC: chris.liddell
Priority: P2    
Version: 10.02.0   
Hardware: PC   
OS: Linux   
Customer: Word Size: ---
Attachments: Reduced to something actually debuggable

Description Vincent Lefevre 2023-10-05 14:34:20 UTC
With Debian's ghostscript 10.02.0~dfsg-2 package (under Debian unstable), if I run ps2pdf on the PDF file from

  https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3163.pdf

the text

  arithmetic. Also, C99’s informative annex G offered [...]

is changed to

  arithmetic. !lso, C99’s informative annex G offered [...]

i.e. the letter "A" is changed to the exclamation point "!".

Issues with special Unicode characters are common due to the lack of proper support for ToUnicode CMap, but here, this just concerns an ASCII character.

Note that this is a regression: there is no such issue with the ghostscript 10.0.0~dfsg-11+deb12u1 package under Debian 12 (bookworm). The PDF file may be particularly ugly, but Ghostscript 10.0.0 could handle it.

If I decompress the streams of the original PDF file with "qpdf --stream-data=uncompress", the TJ lines around the problematic text are:

[(i)5(n)-9(tro)10(d)10(u)7(c)9(e)-11(d)10( )-393(n)-9(e)-11(a)-11(r)5(ly)25( )-393(c)9(o)8(m)-8(p)-11(le)-13(t)20(e)11( )-393(su)5(p)11(p)-11(o)8(r)5(t )-396(f)8(o)8(r)5( )-393(th)4(e)11( )-393(I)6(E)7(C)-4( )-370(6)9(0)9(5)9(5)-13(9)9(:)-8(1)9(9)-13(8)9(9)9( )-393(st)-4(a)-11(n)-9(d)10(a)-11(r)5(d)10( )-393(f)8(o)8(r)5( )-393(bi)7(n)13(a)-11(r)5(y)4( )-393(f)8(lo)7(a)-11(ti)25(n)-9(g)] TJ

[(-)] TJ

[(p)-11(o)8(i)5(n)-9(t)20( )] TJ

[<0083>-11<0094>5<008B>5<0096008A>4<008F>-8<0087>-11<0096008B0085>12<01E40003>-301<0004>9<008E0095>-3<0091>8<01E10003>-301<0006>-4<037B>9<037B>9<01EF>-5<00950003>-304<008B>5<0090>-9<0088>8<0091>-13<0094>5<008F>-8<0083>-11<0096008B0098>6<0087>-11<0003>] TJ

[(a)] TJ

[(n)13(n)-9(e)-11(x)6( )-302(G )-304(o)8(f)8(f)8(e)-11(r)5(e)11(d)10( )-302(a)-11( )-302(sp)-13(e)-11(c)9(i)5(f)8(i)5(c)9(a)-11(tio)11(n)-9( )-302(o)8(f)8( )-324(c)9(o)8(m)-8(p)-11(le)-13(x)6( )-302(a)-11(r)5(i)5(th)4(m)-8(e)-11(tic)12( )-302(th)4(a)-11(t )-305(i)5(s )] TJ
Comment 1 Chris Liddell (chrisl) 2023-10-05 15:48:32 UTC
Created attachment 24934 [details]
Reduced to something actually debuggable
Comment 2 Chris Liddell (chrisl) 2023-10-09 12:25:19 UTC
Fixed in:

https://git.ghostscript.com/?p=ghostpdl.git;a=commitdiff;h=c92cd1c24abf4


I don't know what you mean by "lack of proper support for ToUnicode CMap" nor the "this just concerns an ASCII character". The problem is a bug in the parsing of the ToUnicode CMap for a multi-byte (so very clearly not ASCII) CIDFont.
Comment 3 Vincent Lefevre 2023-10-23 14:19:54 UTC
(In reply to Chris Liddell (chrisl) from comment #2)
> Fixed in:
> 
> https://git.ghostscript.com/?p=ghostpdl.git;a=commitdiff;h=c92cd1c24abf4

I'll try to have a look at it in the next few days.

> I don't know what you mean by "lack of proper support for ToUnicode CMap"
> nor the "this just concerns an ASCII character". The problem is a bug in the
> parsing of the ToUnicode CMap for a multi-byte (so very clearly not ASCII)
> CIDFont.

OK. About "lack of proper support for ToUnicode CMap", there have been various regressions in the past few years concerning *non-ASCII* characters, possibly related to the ToUnicode CMap handling (but only affecting particular non-ASCII characters); some of them (related to the ToUnicode CMap) have been fixed, but see bug 704674 and bug 704681, which are still open (at least the second one appeared when switching to the new PDF interpreter). Concerning "this just concerns an ASCII character", I just meant that this was the first time I was seeing an ASCII character not handled correctly.

I still have to analyze new regressions (to myself: cours05.tex). A bug in poppler or a change on the LaTeX side is not excluded either (I'll have to check that too).
Comment 4 Vincent Lefevre 2023-11-20 14:44:36 UTC
(In reply to Chris Liddell (chrisl) from comment #2)
> Fixed in:
> 
> https://git.ghostscript.com/?p=ghostpdl.git;a=commitdiff;h=c92cd1c24abf4

I confirm that the issue is fixed in Debian's package ghostscript 10.02.1~dfsg-1 (from 2023-11-08 in unstable).

Note also that while ghostscript 10.0.0~dfsg-11+deb12u2 (for Debian 12 (bookworm), which is the current Debian/stable) doesn't have any issue with the "A" changed to "!", it has various similar issues with non-ASCII characters in the n3163.pdf file. This issues do not appear in ghostscript 10.02.1~dfsg-1.

But I can still see the regressions from cours05.tex compared to what I got with the old PDF interpreter.