Bug 701977

Summary: Text search with `mupdf` and `mutool` does not work due to extra spaces / REGRESSION
Product: MuPDF Reporter: OLCC <oc-spam65>
Component: mupdfAssignee: MuPDF bugs <mupdf-bugs>
Status: RESOLVED FIXED    
Severity: normal CC: daniel, jorj.x.mckie, simone.perriello, willus0
Priority: P4    
Version: 1.15.0   
Hardware: PC   
OS: Linux   
Customer: Word Size: ---
Attachments: Test example

Description OLCC 2019-12-07 22:53:20 UTC
Created attachment 18694 [details]
Test example

Hello,

Please consider the attached PDF document with `mupdf` and search for the word "lorem". It fails to find it on my Debian Testing system with `mupdf` version 1.15.0. The reason is that `mupdf` interprets the document with many extra spaces that make the search impossible.

$ mutool draw -F txt test.pdf | head -1
Lor e m

The attached document was created with `pdflatex` and I raised a discussion there, where the conclusion is probably that `mupdf` should non interpret tiny spaces as qualified spaces.
https://tug.org/pipermail/pdftex/2019-December/009162.html

Also, this problem is a REGRESSION between version 1.14.0 and version 1.15.0. 

Indeed, with version 1.14.0, the search worked and we have the correct result:

$ mutool draw -F txt test.pdf | head -1 
Lorem ipsum dolor sit amet, consectetuer adipiscing elit.

This regression was also mentioned in version 1.16.0
https://bugs.ghostscript.com/show_bug.cgi?id=701602

Olivier
Comment 1 Tor Andersson 2020-01-02 14:06:15 UTC
The issue here is with the font metrics of the embedded font, possibly related
to freetype.

The advance of most characters in the embedded font, as reported by freetype,
is 0. This makes MuPDF believe that there is a gap between each character as wide as the character itself; and therefore inserts an artificial space.

Looking at the actual charstrings in the font file I do see plenty of non-zero 'hsbw' instructions, leading me to suspect this may be a problem with FreeType.
Comment 2 Tor Andersson 2020-01-02 15:16:45 UTC
*** Bug 701602 has been marked as a duplicate of this bug. ***
Comment 3 Tor Andersson 2020-01-02 15:17:11 UTC
*** Bug 701979 has been marked as a duplicate of this bug. ***
Comment 4 Tor Andersson 2020-01-08 10:23:36 UTC
The FreeType bug has been reported upstream here:

https://savannah.nongnu.org/bugs/?57519
Comment 5 Tor Andersson 2020-01-22 10:00:15 UTC
commit 82196fd87d98e3c2412049caf890f675ae802676
Author: Tor Andersson <tor.andersson@artifex.com>
Date:   Wed Jan 8 11:22:52 2020 +0100

    Bug 701977: Workaround for bug 57519 in FreeType.
    
    FT_Get_Advance has a bug with certain Type1 fonts, because the fast
    metrics parsing function does not handle 'div' operators.
    
    Disable the fast metrics for Type1 by forcing the use of the old
    Type1 engine for our builds. This will not work for builds using
    the system library, and this workaround should be removed as soon
    as we update to a FreeType a release with a fix for this bug.
Comment 6 Tor Andersson 2020-02-21 15:16:31 UTC
*** Bug 702141 has been marked as a duplicate of this bug. ***
Comment 7 Tor Andersson 2020-06-22 16:21:46 UTC
*** Bug 702509 has been marked as a duplicate of this bug. ***
Comment 8 OLCC 2020-11-12 11:15:50 UTC
On my Debian system, I upgraded the package "libfreetype6" from version "2.10.1-2" to version "2.10.2+dfsg-4", and searching in MuPDF now works as expected.

-- 
Olivier, with MuPDF version "1.17.0+ds1-1"