Created attachment 24173 [details] KJFK RNAV (GPS) Y RWY 31R, AIRAC 2304 Starting in mupdf 1.22.0, and persisting in the current trunk, text extraction occasionally merges adjacent words into one, where previously they were extracted as independent words. I’ve tracked this down to https://github.com/ArtifexSoftware/mupdf/commit/500c0299af8d086d0a20bb2c3a1e7d5c72872cc9, which closed bug 706426. The attached file, 00610RY31R.PDF, exhibits the problem, which occurs in two different locations in this file: 1. In the “briefing strip”, the row above the plan view (map), “GND CON” and “CLNC DEL” run up against each other such that “CONCLNC” is incorrectly extracted as a single word. 2. In the “minimums table”, the table at the bottom right, at the last row (labeled CIRCLING) in columns A and B, the text “640-1 627 (700-1)” is extracted as “640-1627 (700-1)”. I can confirm the difference. In this test, mutool.bad is built from https://github.com/ArtifexSoftware/mupdf/commit/0e5f97d675009246d6ea8a65e2c1481822457539 (current trunk), and mutool.good is the same with 500c0299af8d reverted. mark@arm-and-hammer zsh% diff -u \ <(mupdf_prefix/bin/mutool.good draw -F txt -KK 00610RY31R.PDF) \ <(mupdf_prefix/bin/mutool.bad draw -F txt -KK 00610RY31R.PDF) --- /dev/fd/11 2023-04-21 14:51:30 +++ /dev/fd/12 2023-04-21 14:51:30 @@ -118,8 +118,7 @@ 460/45 447 (500- ) -627 (700-1) -640-1 +640-1627 (700-1) 667 (700-1 ) 680-2 @@ -205,8 +204,7 @@ 13 13 -CLNC DEL -GND CON +GND CONCLNC DEL NEW YORK APP CON KENNEDY TOWER
Fixed with: commit dfaf6071fc6ef28c311dbedf81c2139fd1f8ffc4 (golden/master) Author: Robin Watts <Robin.Watts@artifex.com> Date: Thu May 11 14:37:28 2023 +0100 Bug 706642: Text extraction; maybe add spaces when prepending lines. When we prepend a line to another one, if there is a suitable gap between it and the line we are prepending it to, insert a space. Logic matches the addition of spaces when simply adding chars. Probably also relies on: commit bc140682ab56188d0bbec06a4572be06ccb406f8 Author: Robin Watts <Robin.Watts@artifex.com> Date: Thu May 11 12:14:56 2023 +0100 Bug 706718: Don't prepend text extracted lines if vertically shifted. The bugfix for 706426 was incorrect, in that it did not check for text extracted lines being vertically shifted when considering them for prepending. Fixed here. Thanks for the report.
Just for your interest in seeing how your stuff is used downstream, https://github.com/foobnix/LibreraReader/issues/1115 :)