Summary: | Simple text extraction changed from Mupdf 1.22 to 1.24.8 | ||
---|---|---|---|
Product: | MuPDF | Reporter: | Ardo <aldo.w.buratti> |
Component: | mupdf | Assignee: | MuPDF bugs <mupdf-bugs> |
Status: | RESOLVED DUPLICATE | ||
Severity: | normal | CC: | sebastian.rasmussen |
Priority: | P2 | ||
Version: | 1.24.8 | ||
Hardware: | PC | ||
OS: | Windows 10 | ||
Customer: | Word Size: | --- | |
Attachments: | Text with ligatures and accented letters |
Bisecting reveals that the added new line issue was fixed by 707859 recently included in 1.24.9. *** This bug has been marked as a duplicate of bug 707859 *** |
Created attachment 25957 [details] Text with ligatures and accented letters Please, see the attached PDF (ligatures.pdf) Note there're ligatures (fi) (fl) and other non-ascii characters... Here 's the text extracted with mutool (from MuPdf 1.22) mutool convert -F text -O preserve-ligatures ligatures.pdf --- Zzz.. L’ape si pos`o sul minu- scolo fiore blu fluore- scente. ---- and here the result with the new mutool (from MuPdf 1.24.8) --- Zzz.. L’ape si pos` o sul minu- scolo fiore blu fluore- scente. --- As you can see, the ligatures are still OK, but, the "accented o" is translated in the first case as `o and in the second case as `<newline>o ---- Now, I would have expected the "accented o" to be translated as u00F2, but more importantly, I don't see the reason why a new-line is inserted between the accent and the "o". What is the correct result ?