Bug 700466 - Search results sometimes split into horizontally overlapping rectangles
Summary: Search results sometimes split into horizontally overlapping rectangles
Status: UNCONFIRMED
Alias: None
Product: MuPDF
Classification: Unclassified
Component: fitz (show other bugs)
Version: master
Hardware: PC Windows 7
: P4 normal
Assignee: MuPDF bugs
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-01-09 09:12 UTC by Tamir Evan
Modified: 2024-04-10 00:32 UTC (History)
2 users (show)

See Also:
Customer:
Word Size: ---


Attachments
mupdf-gl showing example_033 PDF with search for "commodo", zoomed to show problem (206.74 KB, image/png)
2019-01-09 09:12 UTC, Tamir Evan
Details
test.js mentioned in the first comment (841 bytes, application/x-javascript)
2019-01-09 09:15 UTC, Tamir Evan
Details
The image created by running test.js (423.01 KB, image/png)
2019-01-09 09:17 UTC, Tamir Evan
Details
Image created by running patched test.js with fixed mutool (783.07 KB, image/png)
2019-01-13 11:36 UTC, Tamir Evan
Details
Image created by running patched test.js with fixed mutool (780.39 KB, image/png)
2019-01-13 11:51 UTC, Tamir Evan
Details
Evince selection example (142.68 KB, image/png)
2024-04-10 00:25 UTC, giuli635
Details
KOReader selection example (200.55 KB, image/png)
2024-04-10 00:29 UTC, giuli635
Details
PDF used to make the selection examples (76.54 KB, application/pdf)
2024-04-10 00:30 UTC, giuli635
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Tamir Evan 2019-01-09 09:12:39 UTC
Created attachment 16681 [details]
mupdf-gl showing example_033 PDF with search for "commodo", zoomed to show problem

If I download the example 033 PDF from the tcpdf website (https://tcpdf.org/files/examples/example_033.pdf), open it with mupdf-gl, built from the latest git source (commit 6c383df4f897c3ceb562807ed92fe6075efffdaf), and search for "commodo", the last result is shown (see attached image) with two vertical lines, before and after the second 'm', that are darker than the rest of the rectangle. These are caused by overlapping search result rectangles, resulting from splitting the last result there into three rectangles.

To demonstrate that, I created a JavaScript file (test.js, to be attached to my next comment), that prints the coordinates for each search result rectangle, and creates an image illustrating the split and the overlap. when I run, with mutool from the same build as above:

    mutool run test.js

I get:

    Result 1:
        Xul = 185.01202392578126
        Yul = 204.28689575195313
        Xur = 225.01202392578126
        Yur = 204.28689575195313
        Xll = 185.01202392578126
        Yll = 217.62689208984376
        Xlr = 225.01202392578126
        Ylr = 217.62689208984376
    Result 2:
        Xul = 216.7663116455078
        Yul = 349.3339538574219
        Xur = 266.4544677734375
        Yur = 349.3339538574219
        Xll = 216.7663116455078
        Yll = 360.9745788574219
        Xlr = 266.4544677734375
        Ylr = 360.9745788574219
    Result 3:
        Xul = 185.32339477539063
        Yul = 459.0024719238281
        Xur = 203.61138916015626
        Yur = 459.0024719238281
        Xll = 185.32339477539063
        Yll = 471.93548583984377
        Xlr = 203.61138916015626
        Ylr = 471.93548583984377
    Result 4:
        Xul = 202.3243865966797
        Yul = 459.0024719238281
        Xur = 211.10838317871095
        Yur = 459.0024719238281
        Xll = 202.3243865966797
        Yll = 471.93548583984377
        Xlr = 211.10838317871095
        Ylr = 471.93548583984377
    Result 5:
        Xul = 209.82138061523438
        Yul = 459.0024719238281
        Xur = 225.1753692626953
        Yur = 459.0024719238281
        Xll = 209.82138061523438
        Yll = 471.93548583984377
        Xlr = 225.1753692626953
        Ylr = 471.93548583984377

and an image (to be attached to my third comment).

Note that for results 3-5, all upper Ys are the same, all lower Ys are the same, and the right Xs of each result are larger than left Xs of the next one. What I should be getting, is something like:

    [...]
    Result 3:
        Xul = 185.32339477539063
        Yul = 459.0024719238281
        Xur = 225.1753692626953
        Yur = 459.0024719238281
        Xll = 185.32339477539063
        Yll = 471.93548583984377
        Xlr = 225.1753692626953
        Ylr = 471.93548583984377
Comment 1 Tamir Evan 2019-01-09 09:15:09 UTC
Created attachment 16683 [details]
test.js mentioned in the first comment
Comment 2 Tamir Evan 2019-01-09 09:17:06 UTC
Created attachment 16686 [details]
The image created by running test.js
Comment 3 Tor Andersson 2019-01-11 13:31:57 UTC
commit eaa4040b69fbb01f77056a4c40f7404627bc499b
Author: Tor Andersson <tor.andersson@artifex.com>
Date:   Wed Jan 9 15:35:30 2019 +0100

    Bug 700466: Use same quad merging threshold for text search as selection.
Comment 4 Tamir Evan 2019-01-13 11:36:21 UTC
Created attachment 16722 [details]
Image created by running patched test.js with fixed mutool

(In reply to Tor Andersson from comment #3)
> commit eaa4040b69fbb01f77056a4c40f7404627bc499b
> Author: Tor Andersson <tor.andersson@artifex.com>
> Date:   Wed Jan 9 15:35:30 2019 +0100
> 
>     Bug 700466: Use same quad merging threshold for text search as selection.

That commit gives the desired result for the example I brought, but doesn't solve the underlying problem.

Another example:

If I download the PDF from http://beta.hebrewbooks.org/pagefeed/hebrewbooks_org_9717_1.pdf (saved as HebrewBooksOrg_9717_page_1.pdf), patch test.js with:

--- test.js	2019-01-13 12:55:01.155221700 +0200
+++ test1.js	2019-01-13 12:55:06.818031600 +0200
@@ -1,4 +1,4 @@
-var doc = new Document('example_033.pdf');
+var doc = new Document('HebrewBooksOrg_9717_page_1.pdf');
 var page = doc.loadPage(0);
 
 var tansform = [4,0,0,4,0,0];
@@ -6,7 +6,7 @@
 var pixmap = page.toPixmap(tansform, DeviceRGB);
 var device = new DrawDevice(Identity, pixmap);
 
-var arr = page.search('commodo');
+var arr = page.search('\u05d4\u05e0\u05e9\u05de'); // He-Nun-Shin-Mem
 var i;
 for(i = 0; i < arr.length; i++)
 {
@@ -31,4 +31,4 @@
 }
 
 device.close();
-pixmap.saveAsPNG('example_033.png');
+pixmap.saveAsPNG('HebrewBooksOrg_9717_page_1.png');

and run it with mutool built from the latest git (commit eaa4040b69fbb01f77056a4c40f7404627bc499b), I get 11 rectangles (where I should be getting 6 now), and the image attached.

The commit has improved the situation, because if I run the patched test.js with an older version of mutool (built from commit 4f08f6adbbb7d6f5d3dc0257b9fc0bb79a3c55cd), I get 23 rectangles.
Comment 5 Tamir Evan 2019-01-13 11:51:12 UTC
Created attachment 16723 [details]
Image created by running patched test.js with fixed mutool

(By mistake I uploaded the wrong image)
Comment 6 giuli635 2024-04-10 00:25:31 UTC
Created attachment 25587 [details]
Evince selection example
Comment 7 giuli635 2024-04-10 00:29:49 UTC
Created attachment 25588 [details]
KOReader selection example
Comment 8 giuli635 2024-04-10 00:30:51 UTC
Created attachment 25589 [details]
PDF used to make the selection examples
Comment 9 giuli635 2024-04-10 00:32:01 UTC
Hello, using KOReader, which uses MuPDF as backend, I've found a bug I first thought it was of KOReader. I create an issue there, but they've found that it was from MuPDF.

The issue is about selection boxes.
Here is how is supposed to be like, in Evince:
https://bugs.ghostscript.com/attachment.cgi?id=25587

Here is what happen with MuPDF in KOReader:
https://bugs.ghostscript.com/attachment.cgi?id=25588

And here is the page of the pdf used in the images:
https://bugs.ghostscript.com/attachment.cgi?id=25589

And, finally a diagnostic made by one of KOReader's maintainers:
    <span font="DLYZCT+LatinModernMath-Regular" wmode="0" bidi="0" trm="9.96264 0 0 9.96264">
        <g unicode="𝜆" glyph="4470" x="236.69" y="353.349" adv=".583"/>
        <g unicode="𝑥" glyph="1319" x="242.49822" y="353.349" adv=".572"/>
        <g unicode="." glyph="15" x="248.19684" y="353.349" adv=".278"/>
        <g unicode="𝑥" glyph="1319" x="250.96645" y="353.349" adv=".572"/>
        <g unicode="𝑧" glyph="1321" x="256.66508" y="353.349" adv=".465"/>
    </span>
    <span font="IWOTBY+LibreBaskerville-Regular" wmode="0" bidi="0" trm="10.161893 0 0 9.96264">
        <g unicode="a" glyph="66" x="264.867" y="353.349" adv=".554"/>
        <g unicode="n" glyph="79" x="270.4967" y="353.349" adv=".689"/>
        <g unicode="d" glyph="69" x="277.49827" y="353.349" adv=".675"/>
    </span>

04/09/24-21:40:58 DEBUG dict lookup word: 𝜆𝑥𝑦.𝑦𝑥 {
  "124x270+497+545"
} --[[table: 0x7c9d79e20440]]