Summary: | Text extraction with MuPDF outputs gibberish on this file, but other tools work well | ||
---|---|---|---|
Product: | MuPDF | Reporter: | Jean Monet <jeanmonet> |
Component: | mupdf | Assignee: | MuPDF bugs <mupdf-bugs> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | jorj.x.mckie, spambin |
Priority: | P4 | ||
Version: | 1.18.0 | ||
Hardware: | PC | ||
OS: | Linux | ||
Customer: | Word Size: | --- | |
Attachments: |
PDF on which MuPDF outputs gibberish (text extraction)
sample roughly hacked to use Arial |
Description
Jean Monet
2021-04-30 21:37:06 UTC
Created attachment 20957 [details]
sample roughly hacked to use Arial
@ Tor
sorry to piggy back this issue but I encounter many such examples where font is supposedly imbedded but text is not searchable / not converted in some common pdf viewers (defeating the whole point of using a PDF)?
I botched substitution (ignoring heights etc.) so roughly that I missed the néant (Freudian slip over a nothingness :-)
I substituted the font to Arial (since I use Windows) and notice the file size dropped drastically in half (which I did not expect)? although as Plain Text the text is < 4KB
I appreciate the core OP issue is why the imbedded fonts do not display in MuPDF on Linux, but would be interested in relative observations such as to the file size they occupy.
The attached file is not spec compliant, it uses a /ToUnicode which is a name and not a stream. There's a simple workaround though, which shouldn't have any negative effects on well formed PDF files. Fixed in commit 546531bc9ba1afb53ead70dbb1860bddbb5053ce Author: Tor Andersson <tor.andersson@artifex.com> Date: Mon May 3 14:15:30 2021 +0200 Bug 703823: Support ToUnicode with built-in CMaps. The example file has a /ToUnicode /Identity-H. Add support for this, and also any other built in CMap while we're at it. Mr (Ms?) Spambin, the file size difference is easily explained: your modified version does not contain any embedded font data where the original does. *** Bug 703213 has been marked as a duplicate of this bug. *** Thank you for your very quick intervention! Will keenly wait for the updated version on conda-forge. |