Bug 706642 - Word breaking regression following workaround for strange text order
Summary: Word breaking regression following workaround for strange text order
Status: RESOLVED FIXED
Alias: None
Product: MuPDF
Classification: Unclassified
Component: mupdf (show other bugs)
Version: 1.22.0
Hardware: All All
: P4 normal
Assignee: MuPDF bugs
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-04-21 18:54 UTC by Mark Mentovai
Modified: 2023-06-06 06:42 UTC (History)
2 users (show)

See Also:
Customer:
Word Size: ---


Attachments
KJFK RNAV (GPS) Y RWY 31R, AIRAC 2304 (301.07 KB, application/pdf)
2023-04-21 18:54 UTC, Mark Mentovai
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mark Mentovai 2023-04-21 18:54:23 UTC
Created attachment 24173 [details]
KJFK RNAV (GPS) Y RWY 31R, AIRAC 2304

Starting in mupdf 1.22.0, and persisting in the current trunk, text extraction occasionally merges adjacent words into one, where previously they were extracted as independent words.

I’ve tracked this down to https://github.com/ArtifexSoftware/mupdf/commit/500c0299af8d086d0a20bb2c3a1e7d5c72872cc9, which closed bug 706426.

The attached file, 00610RY31R.PDF, exhibits the problem, which occurs in two different locations in this file:

1. In the “briefing strip”, the row above the plan view (map), “GND CON” and “CLNC DEL” run up against each other such that “CONCLNC” is incorrectly extracted as a single word.

2. In the “minimums table”, the table at the bottom right, at the last row (labeled CIRCLING) in columns A and B, the text “640-1 627 (700-1)” is extracted as “640-1627 (700-1)”.

I can confirm the difference. In this test, mutool.bad is built from https://github.com/ArtifexSoftware/mupdf/commit/0e5f97d675009246d6ea8a65e2c1481822457539 (current trunk), and mutool.good is the same with 500c0299af8d reverted.

mark@arm-and-hammer zsh% diff -u \
    <(mupdf_prefix/bin/mutool.good draw -F txt -KK 00610RY31R.PDF) \
    <(mupdf_prefix/bin/mutool.bad draw -F txt -KK 00610RY31R.PDF)
--- /dev/fd/11	2023-04-21 14:51:30
+++ /dev/fd/12	2023-04-21 14:51:30
@@ -118,8 +118,7 @@
 460/45
 447 (500-   )
 
-627 (700-1)
-640-1
+640-1627 (700-1)
 667 (700-1   )
 
 680-2
@@ -205,8 +204,7 @@
 13
 13
 
-CLNC DEL
-GND CON
+GND CONCLNC DEL
 NEW YORK APP CON
 KENNEDY TOWER
Comment 1 Robin Watts 2023-05-11 14:39:18 UTC
Fixed with:

commit dfaf6071fc6ef28c311dbedf81c2139fd1f8ffc4 (golden/master)
Author: Robin Watts <Robin.Watts@artifex.com>
Date:   Thu May 11 14:37:28 2023 +0100

    Bug 706642: Text extraction; maybe add spaces when prepending lines.

    When we prepend a line to another one, if there is a suitable gap
    between it and the line we are prepending it to, insert a space.

    Logic matches the addition of spaces when simply adding chars.

Probably also relies on:

commit bc140682ab56188d0bbec06a4572be06ccb406f8
Author: Robin Watts <Robin.Watts@artifex.com>
Date:   Thu May 11 12:14:56 2023 +0100

    Bug 706718: Don't prepend text extracted lines if vertically shifted.

    The bugfix for 706426 was incorrect, in that it did not check for
    text extracted lines being vertically shifted when considering them
    for prepending.

    Fixed here.

Thanks for the report.
Comment 2 Robin Powell 2023-06-06 06:42:19 UTC
Just for your interest in seeing how your stuff is used downstream, https://github.com/foobnix/LibreraReader/issues/1115 :)