707859 – mudraw creates linebreaks between adjacent characters

Bug 707859 - mudraw creates linebreaks between adjacent characters

Summary: mudraw creates linebreaks between adjacent characters

Status:	RESOLVED FIXED

Alias:	None

Product:	MuPDF
Classification:	Unclassified
Component:	mupdf (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	P2 normal
Assignee:	MuPDF bugs

URL:
Keywords:

Duplicates (1):	707960 (view as bug list)
Depends on:
Blocks:

Reported:	2024-07-02 15:27 UTC by Jorj
Modified:	2024-09-02 21:25 UTC (History)
CC List:	2 users (show)

See Also:
Customer:
Word Size:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jorj 2024-07-02 15:27:38 UTC

PyMuPDF issue: https://github.com/pymupdf/PyMuPDF/issues/3650

Problem file: https://github.com/user-attachments/files/16070748/test.pdf

Reproducer: mutool draw test.txt test.pdf

Other extractor tools do work.

Comment 1 Robin Watts 2024-07-25 13:55:31 UTC

Fixed with:

commit b8415aec6a130c09ababed4f4f1ffd102b115c0d
Author: Robin Watts <Robin.Watts@artifex.com>
Date:   Mon Jul 22 17:05:59 2024 +0100

    Bug707859: Tweak text extraction

    We were already allowing for a slight overlap in characters
    when extracting, but the test file in this bug has chars
    squeezed together slightly more than we were expecting.

    Consider the following: (Diagram exploded vertically for
    clarity - in the test file the chars are on the same line).

      +--------+
      |        |
      |        |
      +--------+
          +--------+
          |        |
          |        |
          +--------+
          |<-s-|

    's' in the diagram is 'spacing' in the code.

    The existing code copes with s being negative, if its absolute
    size is smaller than SPACE_DIST (0.15). In this case s is around
    -3.5 which absolute is comfortably less than SPACE_MAX_DIST (0.8).

    Such cases were falling into the 'just consider it as a new line'
    case.

Comment 2 Sebastian Rasmussen 2024-09-02 21:25:39 UTC

*** Bug 707960 has been marked as a duplicate of this bug. ***