Bug 707289 - Text extraction with missing word separation
Summary: Text extraction with missing word separation
Status: UNCONFIRMED
Alias: None
Product: MuPDF
Classification: Unclassified
Component: mupdf (show other bugs)
Version: 1.23.4
Hardware: PC All
: P2 normal
Assignee: Julian Smith
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-10-23 16:49 UTC by Jorj
Modified: 2023-11-14 23:31 UTC (History)
0 users

See Also:
Customer:
Word Size: ---


Attachments
look at page 12 (1-based) to reproduce. (61 bytes, text/plain)
2023-10-23 16:49 UTC, Jorj
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jorj 2023-10-23 16:49:03 UTC
Created attachment 24993 [details]
look at page 12 (1-based) to reproduce.

Word separation not recognized in attached file.
`mutool draw -o test.txt test.pdf` does not recognize word breaks and delivers:
"sureenough,bothfellwiththesameaccelerationandreachedthe"
instead of:
"sure enough, both fell with the same acceleration and reached the"
Comment 1 Jorj 2023-10-23 16:59:01 UTC
The associated PyMuPDF issue on Github: https://github.com/pymupdf/PyMuPDF/issues/2755
Comment 2 Tor Andersson 2023-11-14 16:41:25 UTC
commit b9d3868c390017eadeb36a864771a2cb673504ca
Author: Tor Andersson <tor.andersson@artifex.com>
Date:   Wed Nov 8 17:12:37 2023 +0100

    Set space width in fz_font for use with stext-device.
    
    Base missing space detection on values scaled with the font's actual
    space width when it is available.
Comment 3 Tor Andersson 2023-11-14 23:31:24 UTC
The commit that fixed this bug has been reverted because it caused problems with another file.