Bug 707960 - Simple text extraction changed from Mupdf 1.22 to 1.24.8
Summary: Simple text extraction changed from Mupdf 1.22 to 1.24.8
Status: RESOLVED DUPLICATE of bug 707859
Alias: None
Product: MuPDF
Classification: Unclassified
Component: mupdf (show other bugs)
Version: 1.24.8
Hardware: PC Windows 10
: P2 normal
Assignee: MuPDF bugs
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-08-16 23:55 UTC by Ardo
Modified: 2024-09-02 21:25 UTC (History)
1 user (show)

See Also:
Customer:
Word Size: ---


Attachments
Text with ligatures and accented letters (12.59 KB, application/pdf)
2024-08-16 23:55 UTC, Ardo
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ardo 2024-08-16 23:55:41 UTC
Created attachment 25957 [details]
Text with ligatures and accented letters

Please, see the attached PDF (ligatures.pdf)
 Note there're ligatures (fi) (fl) and other non-ascii characters...

Here 's the text extracted with mutool (from MuPdf 1.22)
  mutool convert -F text -O preserve-ligatures  ligatures.pdf
---
Zzz..
L’ape si pos`o sul minu-
scolo fiore blu fluore-
scente.
----


and here the result with the new mutool (from MuPdf 1.24.8)
---
Zzz..
L’ape si pos`
o sul minu-
scolo fiore blu fluore-
scente.
---

As you can see, the ligatures are still OK, but, the "accented o" is translated 
in the first case as
`o

and in the second case as
`<newline>o

----

Now, I would have expected the "accented o" to be translated as u00F2, but more importantly, I don't see the reason why a new-line is inserted between the accent and the "o".
What is the correct result ?
Comment 1 Sebastian Rasmussen 2024-09-02 21:25:39 UTC
Bisecting reveals that the added new line issue was fixed by 707859 recently included in 1.24.9.

*** This bug has been marked as a duplicate of bug 707859 ***