Created attachment 20956 [details] PDF on which MuPDF outputs gibberish (text extraction) Please see attached PDF. MuPDF does not decode the text in this PDF properly and outputs gibberish. However, text extraction in this document should be possible because other tools manage to do it properly such as: - tika (https://github.com/chrismattmann/tika-python, wrapper for http://tika.apache.org/): text extraction works perfectly. - Opening the PDF with Edge and using Edge's PDF viewer, I am able to copy/paste the text correctly, thus text decoding works as expected. Indeed certain other PDF tools do not handle this document, such as pdftotext (poppler). However since there are tools that manage to do text extraction well, it means that this is possible, and it would be great if MuPDF could do it too.
Created attachment 20957 [details] sample roughly hacked to use Arial @ Tor sorry to piggy back this issue but I encounter many such examples where font is supposedly imbedded but text is not searchable / not converted in some common pdf viewers (defeating the whole point of using a PDF)? I botched substitution (ignoring heights etc.) so roughly that I missed the néant (Freudian slip over a nothingness :-) I substituted the font to Arial (since I use Windows) and notice the file size dropped drastically in half (which I did not expect)? although as Plain Text the text is < 4KB I appreciate the core OP issue is why the imbedded fonts do not display in MuPDF on Linux, but would be interested in relative observations such as to the file size they occupy.
The attached file is not spec compliant, it uses a /ToUnicode which is a name and not a stream. There's a simple workaround though, which shouldn't have any negative effects on well formed PDF files. Fixed in commit 546531bc9ba1afb53ead70dbb1860bddbb5053ce Author: Tor Andersson <tor.andersson@artifex.com> Date: Mon May 3 14:15:30 2021 +0200 Bug 703823: Support ToUnicode with built-in CMaps. The example file has a /ToUnicode /Identity-H. Add support for this, and also any other built in CMap while we're at it.
Mr (Ms?) Spambin, the file size difference is easily explained: your modified version does not contain any embedded font data where the original does.
*** Bug 703213 has been marked as a duplicate of this bug. ***
Thank you for your very quick intervention! Will keenly wait for the updated version on conda-forge.