Bug 707890

Summary: Mutool clean with page selection yields text not extractable
Product: MuPDF Reporter: Jorj <jorj.x.mckie>
Component: mupdfAssignee: MuPDF bugs <mupdf-bugs>
Status: RESOLVED FIXED    
Severity: major CC: robin.watts
Priority: P2    
Version: unspecified   
Hardware: All   
OS: All   
Customer: Word Size: ---

Description Jorj 2024-07-19 14:31:33 UTC
PyMuPDF issue https://github.com/pymupdf/PyMuPDF/issues/3705

File link: https://github.com/user-attachments/files/16312121/946f8445-6373-4f32-994c-04c495e2e7e9.pdf

Reproducer:
-----------
Delivers expected output:
mutool draw -o test.txt test.pdf

However:
---------
mutool clean test.pdf 1-30

then:
mutool draw -o mutool-30.txt out.pdf

Produces crippled, almost empty text: https://github.com/user-attachments/files/16313402/mutool-30.txt
Comment 1 Robin Watts 2024-07-22 15:58:54 UTC
Fixed with:

commit cbe65e8144782a684e1fec56e5dd3dd26beaf65b (golden/master)
Author: Robin Watts <Robin.Watts@artifex.com>
Date:   Fri Jul 19 17:41:22 2024 +0100

    Bug 707890: Carry over structparent information when cleaning.

    We were completely omitting the structure tree when copying.
    This meant that information like "ActualText" was missing,
    resulting in problems when doing text extraction.

    Here we copy the entirety of the Structure Tree across, and
    regenerate the ParentTree so that the Page StructParents still
    point to the right thing.

    We do NOT cut the actual Structure Tree down, so the file remains
    larger than it maybe needs to be - but it is at least correct now.