Bug 706533

Summary: Copy/paste ligatures from luaLaTeX with new PDF interpreter produces invalid chars
Product: Ghostscript Reporter: Justin Beaty <foss>
Component: TextAssignee: Ken Sharp <ken.sharp>
Status: RESOLVED DUPLICATE    
Severity: normal CC: foss, ghostpdl-bugs
Priority: P4    
Version: 10.0.0   
Hardware: PC   
OS: Linux   
Customer: Word Size: ---
Attachments: input.pdf file
output file with new interpreter
output file with old interpreter

Description Justin Beaty 2023-04-01 20:37:36 UTC
Created attachment 23944 [details]
input.pdf file

Hello again,

I have a very simple .tex file, which I compile this with `lualatex input.tex`

```
\documentclass{article}
\begin{document}
ff, fi, fl, ffi, ffl
\end{document}
```

If I open this file with evince or Adobe Reader and copy/paste, I get "ff, fi, fl, ffi, ffl" as expected.

---

Now, if I run the outputted file through gs 10.00.0:

gs -sDEVICE=pdfwrite -o out-new.pdf -f input.pdf

And copy/paste, I get "昀昀, 昀椀, 昀氀, 昀케, 昀툀".

---

However, if I use:

gs -sDEVICE=pdfwrite -dNEWPDF=false -o out-old.pdf -f input.pdf

And copy/paste, I get "ff, fi, fl, ffi, ffl" again.

---

I am not sure if this is a problem with gs or with luaLaTeX (my thought is it's the latter). However since there is a difference between the old and new PDF interpreter, I thought it warranted a bug report.

I haven't been able to test 10.01.1 yet, so I apologize in case this has been fixed already.
Comment 1 Justin Beaty 2023-04-01 20:38:07 UTC
Created attachment 23945 [details]
output file with new interpreter
Comment 2 Justin Beaty 2023-04-01 20:38:21 UTC
Created attachment 23946 [details]
output file with old interpreter
Comment 3 Ken Sharp 2023-04-04 08:23:24 UTC
This commit:

34055411d34255d811dd091e7f771b92d4494600

fixes the problem with double characters. The problem with Unicode code point mappings exceeding 4 bytes already has a bug report:

https://bugs.ghostscript.com/show_bug.cgi?id=704674

The result is somewhat different because that is a Font file rather than a CIDFont, so the ToUnicode CMap gets dropped entirely instead of this case, which causes incorrect values.

But fundamentally the problem remains the same, the current code can't cope with ToUnicode CMaps which contain more than 4 bytes worth of Unicode Code point.

We'll deal with that as one project so I'm just going to add the remaining part of this bug to that report.

*** This bug has been marked as a duplicate of bug 704674 ***
Comment 4 Justin Beaty 2023-04-04 18:37:45 UTC
(In reply to Ken Sharp from comment #3)
> This commit:
> 
> 34055411d34255d811dd091e7f771b92d4494600
> 
> fixes the problem with double characters.

Awesome, I've tested this commit and confirmed the double chars are fixed.

> The problem with Unicode code
> point mappings exceeding 4 bytes already has a bug report:
> 
> https://bugs.ghostscript.com/show_bug.cgi?id=704674

Got it and thanks for adding me to the CC list there. Since it's a larger refactor I understand it may take time. Fortunately, I don't have any actual PDFs that have this problem, I was just monkeying around with LaTeX a bit.