Summary: | Text extraction with missing word separation | ||
---|---|---|---|
Product: | MuPDF | Reporter: | Jorj <jorj.x.mckie> |
Component: | mupdf | Assignee: | Julian Smith <julian.smith> |
Status: | UNCONFIRMED --- | ||
Severity: | normal | ||
Priority: | P2 | ||
Version: | 1.23.4 | ||
Hardware: | PC | ||
OS: | All | ||
Customer: | Word Size: | --- | |
Attachments: | look at page 12 (1-based) to reproduce. |
The associated PyMuPDF issue on Github: https://github.com/pymupdf/PyMuPDF/issues/2755 commit b9d3868c390017eadeb36a864771a2cb673504ca Author: Tor Andersson <tor.andersson@artifex.com> Date: Wed Nov 8 17:12:37 2023 +0100 Set space width in fz_font for use with stext-device. Base missing space detection on values scaled with the font's actual space width when it is available. The commit that fixed this bug has been reverted because it caused problems with another file. |
Created attachment 24993 [details] look at page 12 (1-based) to reproduce. Word separation not recognized in attached file. `mutool draw -o test.txt test.pdf` does not recognize word breaks and delivers: "sureenough,bothfellwiththesameaccelerationandreachedthe" instead of: "sure enough, both fell with the same acceleration and reached the"