Bug 707289

Summary: Text extraction with missing word separation
Product: MuPDF Reporter: Jorj <jorj.x.mckie>
Component: mupdfAssignee: Julian Smith <julian.smith>
Status: UNCONFIRMED ---    
Severity: normal    
Priority: P2    
Version: 1.23.4   
Hardware: PC   
OS: All   
Customer: Word Size: ---
Attachments: look at page 12 (1-based) to reproduce.

Description Jorj 2023-10-23 16:49:03 UTC
Created attachment 24993 [details]
look at page 12 (1-based) to reproduce.

Word separation not recognized in attached file.
`mutool draw -o test.txt test.pdf` does not recognize word breaks and delivers:
"sureenough,bothfellwiththesameaccelerationandreachedthe"
instead of:
"sure enough, both fell with the same acceleration and reached the"
Comment 1 Jorj 2023-10-23 16:59:01 UTC
The associated PyMuPDF issue on Github: https://github.com/pymupdf/PyMuPDF/issues/2755
Comment 2 Tor Andersson 2023-11-14 16:41:25 UTC
commit b9d3868c390017eadeb36a864771a2cb673504ca
Author: Tor Andersson <tor.andersson@artifex.com>
Date:   Wed Nov 8 17:12:37 2023 +0100

    Set space width in fz_font for use with stext-device.
    
    Base missing space detection on values scaled with the font's actual
    space width when it is available.
Comment 3 Tor Andersson 2023-11-14 23:31:24 UTC
The commit that fixed this bug has been reverted because it caused problems with another file.