Bug 707890 - Mutool clean with page selection yields text not extractable
Summary: Mutool clean with page selection yields text not extractable
Status: RESOLVED FIXED
Alias: None
Product: MuPDF
Classification: Unclassified
Component: mupdf (show other bugs)
Version: unspecified
Hardware: All All
: P2 major
Assignee: MuPDF bugs
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-07-19 14:31 UTC by Jorj
Modified: 2024-07-22 15:58 UTC (History)
1 user (show)

See Also:
Customer:
Word Size: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jorj 2024-07-19 14:31:33 UTC
PyMuPDF issue https://github.com/pymupdf/PyMuPDF/issues/3705

File link: https://github.com/user-attachments/files/16312121/946f8445-6373-4f32-994c-04c495e2e7e9.pdf

Reproducer:
-----------
Delivers expected output:
mutool draw -o test.txt test.pdf

However:
---------
mutool clean test.pdf 1-30

then:
mutool draw -o mutool-30.txt out.pdf

Produces crippled, almost empty text: https://github.com/user-attachments/files/16313402/mutool-30.txt
Comment 1 Robin Watts 2024-07-22 15:58:54 UTC
Fixed with:

commit cbe65e8144782a684e1fec56e5dd3dd26beaf65b (golden/master)
Author: Robin Watts <Robin.Watts@artifex.com>
Date:   Fri Jul 19 17:41:22 2024 +0100

    Bug 707890: Carry over structparent information when cleaning.

    We were completely omitting the structure tree when copying.
    This meant that information like "ActualText" was missing,
    resulting in problems when doing text extraction.

    Here we copy the entirety of the Structure Tree across, and
    regenerate the ParentTree so that the Page StructParents still
    point to the right thing.

    We do NOT cut the actual Structure Tree down, so the file remains
    larger than it maybe needs to be - but it is at least correct now.