Created attachment 27182 [details] Broken ligature samples I'm not sure if this is caused by actual GS bug or the source PDF file is somehow problematic, but IMO it shouldn't happen in any case. I'm attaching sample PDF shortened to 1 line. It has valid text encoding, i.e. it's possible to copy+paste text that corresponds to glyphs, even in In Adobe Reader and Adobe Acrobat XI. The PDF was originally created in ancient software and its font encoding is rather complicated, with lots of Differences and partial use of ToUnicode table. When I process it with PDFwrite, encoding of all characters is preserved, except one: fi ligature. It copies as 0x03 in the output file, likely because its glyph is in the 3rd place in /CharStrings. Note this happens even when I merely "re-generate" the source file, without OCR or any other processing. Exact command is: gswin64c -dNOPAUSE -sDEVICE=pdfwrite -sOutputFile=Sample_1line_out.pdf Sample_1line.pdf -c quit I think it happens because the ligature uses old Adobe Glyph List name "f_i" instead of currently used "fi". Or at least in this thread from 2004 someone says Adobe decided to change it: https://community.adobe.com/t5/type-typography-discussions/fi-fl-or-f-i-f-l-which-is-right/td-p/1633047 It happens with 10.06.0rc2 installed from gs10060rc2w64.exe I got here: https://github.com/ArtifexSoftware/ghostpdl-downloads/releases/tag/gs10060rc2 I'm also attaching full original page from the source PDF, though the result is identical.
(In reply to Pavel Hanak from comment #0) > I'm attaching sample PDF shortened to 1 line. It has valid text encoding, > i.e. it's possible to copy+paste text that corresponds to glyphs, even in In > Adobe Reader and Adobe Acrobat XI. The PDF was originally created in ancient > software and its font encoding is rather complicated, with lots of > Differences and partial use of ToUnicode table. When I process it with > PDFwrite, encoding of all characters is preserved, except one: fi ligature. > It copies as 0x03 in the output file, likely because its glyph is in the 3rd > place in /CharStrings. Note this happens even when I merely "re-generate" > the source file, without OCR or any other processing. Exact command is: The original PDF file has a ToUnicode CMap which contains: 1 beginbfrange <03> <03> [<00660069>] endbfrange So that maps a single glyph to 2 Unicode code points. The pdfwrite device can't handle that. *** This bug has been marked as a duplicate of bug 704674 ***
OMG it's at the very bottom of the ToUnicode table and I haven't noticed it. Sorry about that.