Bug 708005 - Erroneous StructureTree interpretation
Summary: Erroneous StructureTree interpretation
Status: RESOLVED FIXED
Alias: None
Product: MuPDF
Classification: Unclassified
Component: mupdf (show other bugs)
Version: unspecified
Hardware: All All
: P2 critical
Assignee: Sebastian Rasmussen
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-09-04 10:37 UTC by Jorj
Modified: 2024-11-21 03:36 UTC (History)
1 user (show)

See Also:
Customer:
Word Size: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jorj 2024-09-04 10:37:33 UTC
PyMuPDF post: https://github.com/pymupdf/PyMuPDF/discussions/3838

Problem file: https://github.com/user-attachments/files/16853235/test.pdf

Problem detail:
When extracting (plain) text from page 0, a completely different output is extracted than what is displayed by any PDF viewer (including MuPDF-GL), or extracted by other text extractors.

For example (page 0), instead of extracting this:

Your document is P1% aligned with
OSHA standards. Let’s change that!

the following text is extracted:

Your document is 82.35% aligned wi
th OSHA standards. Let’s change that!

When semi-manually removing the StructureTreeRoot object from the PDF's catalog, behavior is "back to normal."
Comment 1 Jorj 2024-09-04 11:36:24 UTC
Report in PyMuPDF Discussions has been converted to a proper bug report: https://github.com/pymupdf/PyMuPDF/issues/3845
Comment 2 Sebastian Rasmussen 2024-09-09 11:54:41 UTC
Object 462 in the structure tree contains the ActualText "Your document is 82.35% aligned with OSHA standards. Let\220s change that!" which replaces to the visual text "Your document is P1% aligned with OSHA standards. Let's change that!".

As far as I read the PDF specification this ActualText should be taken into account when extracting text for searching, indexing, etc. a PDF. So I can't say that MuPDF is doing anything wrong here.

It could be argued that MuPDF's fz_stext_device should have an option to ignore meta text, e.g. ActualText.
Comment 3 Jorj 2024-09-09 12:08:20 UTC
I think such an option would make sense.

However, as in all similar cases, the programmer has no prior knowledge about that a situation needs extra care.
After all, NOT delivering existing ActualText is also a problem that we just recently resolved.

So it is challenging task to explain such a complex, non-intuitive situation.
Comment 4 Sebastian Rasmussen 2024-09-09 23:38:53 UTC
I have a proposed patch that adds a device flag to inhibit actual text. If that is accepted during review I think that might be the best course of action here.
Comment 5 Sebastian Rasmussen 2024-09-09 23:39:56 UTC
But I still think applying ActualText is a good default since the PDF spec proposes exactly that!
Comment 6 Sebastian Rasmussen 2024-11-21 03:36:17 UTC
Fixed a while ago by

commit 887a7b0bac393c3df1fc97fbdd4d933290974b1b
Author: Sebastian Rasmussen <sebras@gmail.com>
Date:   Tue Sep 10 00:03:36 2024 +0200

    Bug 708005: Add device flag to ignore ActualText replacing original text.
    
    This makes it possible for users to opt out of using ActualText when
    extracting the text.