Bug 701977 - Text search with `mupdf` and `mutool` does not work due to extra spaces / REGRESSION
Summary: Text search with `mupdf` and `mutool` does not work due to extra spaces / REG...
Status: RESOLVED FIXED
Alias: None
Product: MuPDF
Classification: Unclassified
Component: mupdf (show other bugs)
Version: 1.15.0
Hardware: PC Linux
: P4 normal
Assignee: MuPDF bugs
URL:
Keywords:
: 701602 701979 702141 702509 (view as bug list)
Depends on:
Blocks:
 
Reported: 2019-12-07 22:53 UTC by OLCC
Modified: 2020-11-12 11:15 UTC (History)
4 users (show)

See Also:
Customer:
Word Size: ---


Attachments
Test example (33.31 KB, application/pdf)
2019-12-07 22:53 UTC, OLCC
Details

Note You need to log in before you can comment on or make changes to this bug.
Description OLCC 2019-12-07 22:53:20 UTC
Created attachment 18694 [details]
Test example

Hello,

Please consider the attached PDF document with `mupdf` and search for the word "lorem". It fails to find it on my Debian Testing system with `mupdf` version 1.15.0. The reason is that `mupdf` interprets the document with many extra spaces that make the search impossible.

$ mutool draw -F txt test.pdf | head -1
Lor e m

The attached document was created with `pdflatex` and I raised a discussion there, where the conclusion is probably that `mupdf` should non interpret tiny spaces as qualified spaces.
https://tug.org/pipermail/pdftex/2019-December/009162.html

Also, this problem is a REGRESSION between version 1.14.0 and version 1.15.0. 

Indeed, with version 1.14.0, the search worked and we have the correct result:

$ mutool draw -F txt test.pdf | head -1 
Lorem ipsum dolor sit amet, consectetuer adipiscing elit.

This regression was also mentioned in version 1.16.0
https://bugs.ghostscript.com/show_bug.cgi?id=701602

Olivier
Comment 1 Tor Andersson 2020-01-02 14:06:15 UTC
The issue here is with the font metrics of the embedded font, possibly related
to freetype.

The advance of most characters in the embedded font, as reported by freetype,
is 0. This makes MuPDF believe that there is a gap between each character as wide as the character itself; and therefore inserts an artificial space.

Looking at the actual charstrings in the font file I do see plenty of non-zero 'hsbw' instructions, leading me to suspect this may be a problem with FreeType.
Comment 2 Tor Andersson 2020-01-02 15:16:45 UTC
*** Bug 701602 has been marked as a duplicate of this bug. ***
Comment 3 Tor Andersson 2020-01-02 15:17:11 UTC
*** Bug 701979 has been marked as a duplicate of this bug. ***
Comment 4 Tor Andersson 2020-01-08 10:23:36 UTC
The FreeType bug has been reported upstream here:

https://savannah.nongnu.org/bugs/?57519
Comment 5 Tor Andersson 2020-01-22 10:00:15 UTC
commit 82196fd87d98e3c2412049caf890f675ae802676
Author: Tor Andersson <tor.andersson@artifex.com>
Date:   Wed Jan 8 11:22:52 2020 +0100

    Bug 701977: Workaround for bug 57519 in FreeType.
    
    FT_Get_Advance has a bug with certain Type1 fonts, because the fast
    metrics parsing function does not handle 'div' operators.
    
    Disable the fast metrics for Type1 by forcing the use of the old
    Type1 engine for our builds. This will not work for builds using
    the system library, and this workaround should be removed as soon
    as we update to a FreeType a release with a fix for this bug.
Comment 6 Tor Andersson 2020-02-21 15:16:31 UTC
*** Bug 702141 has been marked as a duplicate of this bug. ***
Comment 7 Tor Andersson 2020-06-22 16:21:46 UTC
*** Bug 702509 has been marked as a duplicate of this bug. ***
Comment 8 OLCC 2020-11-12 11:15:50 UTC
On my Debian system, I upgraded the package "libfreetype6" from version "2.10.1-2" to version "2.10.2+dfsg-4", and searching in MuPDF now works as expected.

-- 
Olivier, with MuPDF version "1.17.0+ds1-1"