PyMuPDF issue https://github.com/pymupdf/PyMuPDF/issues/3705 File link: https://github.com/user-attachments/files/16312121/946f8445-6373-4f32-994c-04c495e2e7e9.pdf Reproducer: ----------- Delivers expected output: mutool draw -o test.txt test.pdf However: --------- mutool clean test.pdf 1-30 then: mutool draw -o mutool-30.txt out.pdf Produces crippled, almost empty text: https://github.com/user-attachments/files/16313402/mutool-30.txt
Fixed with: commit cbe65e8144782a684e1fec56e5dd3dd26beaf65b (golden/master) Author: Robin Watts <Robin.Watts@artifex.com> Date: Fri Jul 19 17:41:22 2024 +0100 Bug 707890: Carry over structparent information when cleaning. We were completely omitting the structure tree when copying. This meant that information like "ActualText" was missing, resulting in problems when doing text extraction. Here we copy the entirety of the Structure Tree across, and regenerate the ParentTree so that the Page StructParents still point to the right thing. We do NOT cut the actual Structure Tree down, so the file remains larger than it maybe needs to be - but it is at least correct now.