703823 – Text extraction with MuPDF outputs gibberish on this file, but other tools work well

Bug 703823 - Text extraction with MuPDF outputs gibberish on this file, but other tools work well

Summary: Text extraction with MuPDF outputs gibberish on this file, but other tools wo...

Status:	RESOLVED FIXED

Alias:	None

Product:	MuPDF
Classification:	Unclassified
Component:	mupdf (show other bugs)
Version:	1.18.0
Hardware:	PC Linux

Importance:	P4 normal
Assignee:	MuPDF bugs

URL:
Keywords:

Duplicates (1):	703213 (view as bug list)
Depends on:
Blocks:

Reported:	2021-04-30 21:37 UTC by Jean Monet
Modified:	2021-05-06 19:41 UTC (History)
CC List:	2 users (show)

See Also:
Customer:
Word Size:	---

Attachments
PDF on which MuPDF outputs gibberish (text extraction) (31.78 KB, application/pdf) 2021-04-30 21:37 UTC, Jean Monet	Details
sample roughly hacked to use Arial (15.01 KB, application/pdf) 2021-05-02 00:09 UTC, spambin	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jean Monet 2021-04-30 21:37:06 UTC

Created attachment 20956 [details]
PDF on which MuPDF outputs gibberish (text extraction)

Please see attached PDF.

MuPDF does not decode the text in this PDF properly and outputs gibberish.

However, text extraction in this document should be possible because other tools manage to do it properly such as:
- tika (https://github.com/chrismattmann/tika-python, wrapper for http://tika.apache.org/): text extraction works perfectly.
- Opening the PDF with Edge and using Edge's PDF viewer, I am able to copy/paste the text correctly, thus text decoding works as expected.

Indeed certain other PDF tools do not handle this document, such as pdftotext (poppler). However since there are tools that manage to do text extraction well, it means that this is possible, and it would be great if MuPDF could do it too.

Comment 1 spambin 2021-05-02 00:09:18 UTC

Created attachment 20957 [details]
sample roughly hacked to use Arial

@ Tor
sorry to piggy back this issue but I encounter many such examples where font is supposedly imbedded but text is not searchable / not converted in some common pdf viewers (defeating the whole point of using a PDF)?

I botched substitution (ignoring heights etc.) so roughly that I missed the néant (Freudian slip over a nothingness :-)
I substituted the font to Arial (since I use Windows) and notice the file size dropped drastically in half (which I did not expect)? although as Plain Text the text is < 4KB

I appreciate the core OP issue is why the imbedded fonts do not display in MuPDF on Linux, but would be interested in relative observations such as to the file size they occupy.

Comment 2 Tor Andersson 2021-05-04 19:00:45 UTC

The attached file is not spec compliant, it uses a /ToUnicode which is a name and not a stream. There's a simple workaround though, which shouldn't have any negative effects on well formed PDF files.

Fixed in commit 546531bc9ba1afb53ead70dbb1860bddbb5053ce
Author: Tor Andersson <tor.andersson@artifex.com>
Date:   Mon May 3 14:15:30 2021 +0200

    Bug 703823: Support ToUnicode with built-in CMaps.
    
    The example file has a /ToUnicode /Identity-H.
    Add support for this, and also any other built in CMap while we're at it.

Comment 3 Tor Andersson 2021-05-04 19:04:56 UTC

Mr (Ms?) Spambin, the file size difference is easily explained: your modified version does not contain any embedded font data where the original does.

Comment 4 Tor Andersson 2021-05-05 12:51:47 UTC

*** Bug 703213 has been marked as a duplicate of this bug. ***

Comment 5 Jean Monet 2021-05-06 19:41:51 UTC

Thank you for your very quick intervention! Will keenly wait for the updated version on conda-forge.