PyMuPDF post: https://github.com/pymupdf/PyMuPDF/discussions/3838 Problem file: https://github.com/user-attachments/files/16853235/test.pdf Problem detail: When extracting (plain) text from page 0, a completely different output is extracted than what is displayed by any PDF viewer (including MuPDF-GL), or extracted by other text extractors. For example (page 0), instead of extracting this: Your document is P1% aligned with OSHA standards. Let’s change that! the following text is extracted: Your document is 82.35% aligned wi th OSHA standards. Let’s change that! When semi-manually removing the StructureTreeRoot object from the PDF's catalog, behavior is "back to normal."
Report in PyMuPDF Discussions has been converted to a proper bug report: https://github.com/pymupdf/PyMuPDF/issues/3845
Object 462 in the structure tree contains the ActualText "Your document is 82.35% aligned with OSHA standards. Let\220s change that!" which replaces to the visual text "Your document is P1% aligned with OSHA standards. Let's change that!". As far as I read the PDF specification this ActualText should be taken into account when extracting text for searching, indexing, etc. a PDF. So I can't say that MuPDF is doing anything wrong here. It could be argued that MuPDF's fz_stext_device should have an option to ignore meta text, e.g. ActualText.
I think such an option would make sense. However, as in all similar cases, the programmer has no prior knowledge about that a situation needs extra care. After all, NOT delivering existing ActualText is also a problem that we just recently resolved. So it is challenging task to explain such a complex, non-intuitive situation.
I have a proposed patch that adds a device flag to inhibit actual text. If that is accepted during review I think that might be the best course of action here.
But I still think applying ActualText is a good default since the PDF spec proposes exactly that!
Fixed a while ago by commit 887a7b0bac393c3df1fc97fbdd4d933290974b1b Author: Sebastian Rasmussen <sebras@gmail.com> Date: Tue Sep 10 00:03:36 2024 +0200 Bug 708005: Add device flag to ignore ActualText replacing original text. This makes it possible for users to opt out of using ActualText when extracting the text.