PyMuPDF issue: https://github.com/pymupdf/PyMuPDF/issues/3650 Problem file: https://github.com/user-attachments/files/16070748/test.pdf Reproducer: mutool draw test.txt test.pdf Other extractor tools do work.
Fixed with: commit b8415aec6a130c09ababed4f4f1ffd102b115c0d Author: Robin Watts <Robin.Watts@artifex.com> Date: Mon Jul 22 17:05:59 2024 +0100 Bug707859: Tweak text extraction We were already allowing for a slight overlap in characters when extracting, but the test file in this bug has chars squeezed together slightly more than we were expecting. Consider the following: (Diagram exploded vertically for clarity - in the test file the chars are on the same line). +--------+ | | | | +--------+ +--------+ | | | | +--------+ |<-s-| 's' in the diagram is 'spacing' in the code. The existing code copes with s being negative, if its absolute size is smaller than SPACE_DIST (0.15). In this case s is around -3.5 which absolute is comfortably less than SPACE_MAX_DIST (0.8). Such cases were falling into the 'just consider it as a new line' case.
*** Bug 707960 has been marked as a duplicate of this bug. ***