706533 – Copy/paste ligatures from luaLaTeX with new PDF interpreter produces invalid chars

Bug 706533 - Copy/paste ligatures from luaLaTeX with new PDF interpreter produces invalid chars

Summary: Copy/paste ligatures from luaLaTeX with new PDF interpreter produces invalid ...

Status:	RESOLVED DUPLICATE of bug 704674

Alias:	None

Product:	Ghostscript
Classification:	Unclassified
Component:	Text (show other bugs)
Version:	10.0.0
Hardware:	PC Linux

Importance:	P4 normal
Assignee:	Ken Sharp

URL:
Keywords:

Depends on:
Blocks:

Reported:	2023-04-01 20:37 UTC by Justin Beaty
Modified:	2025-05-14 06:33 UTC (History)
CC List:	3 users (show)

See Also:
Customer:
Word Size:	---

Attachments
input.pdf file (3.06 KB, application/pdf) 2023-04-01 20:37 UTC, Justin Beaty	Details
output file with new interpreter (4.32 KB, application/pdf) 2023-04-01 20:38 UTC, Justin Beaty	Details
output file with old interpreter (4.52 KB, application/pdf) 2023-04-01 20:38 UTC, Justin Beaty	Details
input.pdf file (3.43 KB, application/pdf) 2025-05-14 06:32 UTC, Michael Wedl	Details
out.pdf generated by ghostscript 10.05.0 (4.71 KB, application/pdf) 2025-05-14 06:33 UTC, Michael Wedl	Details
out.pdf generated by ghostscript 10.04.0 (4.43 KB, application/pdf) 2025-05-14 06:33 UTC, Michael Wedl	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Justin Beaty 2023-04-01 20:37:36 UTC

Created attachment 23944 [details]
input.pdf file

Hello again,

I have a very simple .tex file, which I compile this with `lualatex input.tex`

```
\documentclass{article}
\begin{document}
ff, fi, fl, ffi, ffl
\end{document}
```

If I open this file with evince or Adobe Reader and copy/paste, I get "ff, fi, fl, ffi, ffl" as expected.

---

Now, if I run the outputted file through gs 10.00.0:

gs -sDEVICE=pdfwrite -o out-new.pdf -f input.pdf

And copy/paste, I get "昀昀, 昀椀, 昀氀, 昀케, 昀툀".

---

However, if I use:

gs -sDEVICE=pdfwrite -dNEWPDF=false -o out-old.pdf -f input.pdf

And copy/paste, I get "ff, fi, fl, ffi, ffl" again.

---

I am not sure if this is a problem with gs or with luaLaTeX (my thought is it's the latter). However since there is a difference between the old and new PDF interpreter, I thought it warranted a bug report.

I haven't been able to test 10.01.1 yet, so I apologize in case this has been fixed already.

Comment 1 Justin Beaty 2023-04-01 20:38:07 UTC

Created attachment 23945 [details]
output file with new interpreter

Comment 2 Justin Beaty 2023-04-01 20:38:21 UTC

Created attachment 23946 [details]
output file with old interpreter

Comment 3 Ken Sharp 2023-04-04 08:23:24 UTC

This commit:

34055411d34255d811dd091e7f771b92d4494600

fixes the problem with double characters. The problem with Unicode code point mappings exceeding 4 bytes already has a bug report:

https://bugs.ghostscript.com/show_bug.cgi?id=704674

The result is somewhat different because that is a Font file rather than a CIDFont, so the ToUnicode CMap gets dropped entirely instead of this case, which causes incorrect values.

But fundamentally the problem remains the same, the current code can't cope with ToUnicode CMaps which contain more than 4 bytes worth of Unicode Code point.

We'll deal with that as one project so I'm just going to add the remaining part of this bug to that report.

*** This bug has been marked as a duplicate of bug 704674 ***

Comment 4 Justin Beaty 2023-04-04 18:37:45 UTC

(In reply to Ken Sharp from comment #3)
> This commit:
> 
> 34055411d34255d811dd091e7f771b92d4494600
> 
> fixes the problem with double characters.

Awesome, I've tested this commit and confirmed the double chars are fixed.

> The problem with Unicode code
> point mappings exceeding 4 bytes already has a bug report:
> 
> https://bugs.ghostscript.com/show_bug.cgi?id=704674

Got it and thanks for adding me to the CC list there. Since it's a larger refactor I understand it may take time. Fortunately, I don't have any actual PDFs that have this problem, I was just monkeying around with LaTeX a bit.

Comment 5 Michael Wedl 2025-05-14 06:31:23 UTC

Hi,

it seems like ghostscript 10.05.0 introduced a regression for the double character bug. In ghostscript 10.04.0 this bug is not present.

I encountered the bug with a PDF generated by weasyprint (copy paste works as expected on the input PDF), after compressing it with ghostscript (`gs -sDEVICE=pdfwrite -o out.pdf -f input.pdf`).

Comment 6 Michael Wedl 2025-05-14 06:32:21 UTC

Created attachment 26793 [details]
input.pdf file

Comment 7 Michael Wedl 2025-05-14 06:33:28 UTC

Created attachment 26794 [details]
out.pdf generated by ghostscript 10.05.0

Comment 8 Michael Wedl 2025-05-14 06:33:59 UTC

Created attachment 26795 [details]
out.pdf generated by ghostscript 10.04.0