707289 – Text extraction with missing word separation

Bug 707289 - Text extraction with missing word separation

Summary: Text extraction with missing word separation

Status:	UNCONFIRMED

Alias:	None

Product:	MuPDF
Classification:	Unclassified
Component:	mupdf (show other bugs)
Version:	1.23.4
Hardware:	PC All

Importance:	P2 normal
Assignee:	Julian Smith

URL:
Keywords:

Depends on:
Blocks:

Reported:	2023-10-23 16:49 UTC by Jorj
Modified:	2023-11-14 23:31 UTC (History)
CC List:	0 users

See Also:
Customer:
Word Size:	---

Attachments
look at page 12 (1-based) to reproduce. (61 bytes, text/plain) 2023-10-23 16:49 UTC, Jorj	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jorj 2023-10-23 16:49:03 UTC

Created attachment 24993 [details]
look at page 12 (1-based) to reproduce.

Word separation not recognized in attached file.
`mutool draw -o test.txt test.pdf` does not recognize word breaks and delivers:
"sureenough,bothfellwiththesameaccelerationandreachedthe"
instead of:
"sure enough, both fell with the same acceleration and reached the"

Comment 1 Jorj 2023-10-23 16:59:01 UTC

The associated PyMuPDF issue on Github: https://github.com/pymupdf/PyMuPDF/issues/2755

Comment 2 Tor Andersson 2023-11-14 16:41:25 UTC

commit b9d3868c390017eadeb36a864771a2cb673504ca
Author: Tor Andersson <tor.andersson@artifex.com>
Date:   Wed Nov 8 17:12:37 2023 +0100

    Set space width in fz_font for use with stext-device.
    
    Base missing space detection on values scaled with the font's actual
    space width when it is available.

Comment 3 Tor Andersson 2023-11-14 23:31:24 UTC

The commit that fixed this bug has been reverted because it caused problems with another file.