PyMuPDF bug: https://github.com/pymupdf/PyMuPDF/issues/3863 Problem file: https://github.com/user-attachments/files/17007483/test2.pdf Problem description: This file contains 8 pages with redaction rectangles covering each page completely. The pages consist of scanned images only - no text, no vector graphics. Some pages are rotated. When applying the redactions using "images=0, graphics=0, text=0", some of the rotated pages are fully emptied, whereas the expected result is a no-op. The following error messages are displayed: "syntax error: cannot find XObject resource 'Im1'".
I can reproduce this by opening the input file in mupdf-gl, pressing 'R' and selecting not to draw black boxes and selecting to keep images, graphics and text and then pressing "Redact document".
This input file is tricky! Page 5 stored in PDF object 15 0 R has /Contents 32 0 R, but so has page 7 stored in PDF object 17 0 R. Note that it is not the resource dictionaries that are shared between pages, it is the content stream! Presumably because the scanned images just happened to be exactly the same size and needed to be scaled by the same amount? So when pdf_redact_page() and its collaborator pdf_filter_content_stream() goes and filters the contents of page 5 and renames the image resources in the page resource dictionar of that page it ALSO renames the page resources of page 7, but the content stream of page 7 is not updated. This leaves the page 7 content stream and its page resources in a broken state. Later on when mupdf-gl's step_redact_all_pages() calls pdf_redact_page() for page 7 the image resources can not be found under the name expects in the page's resource dictionary.
Knowing that and looking at the source I think that the problem is in pdf_filter_page_contents() because may create a new page Contents stream object, but only if the contents Stream object not a stream (in this case it was presumably an array of streams). If this were to create a new page Contents stream object unconditionally this file can be redacted without problems.
Interesting! I did not realize the reuse of the same content objects between different pages. Is that even legal as per PDF spec? I did not find explicit comments in this regard.
If it is not explicitly forbidden it must be allowed?
Fixed in commit 934cc6babad2d389d5fbe4128183c628107443de Author: Sebastian Rasmussen <sebras@gmail.com> Date: Fri Sep 27 03:23:27 2024 +0200 Bug 708032: When redacting pages, create new content stream objects, do not replace them. In the file from the bug both page 5 and page 7 refer to the same contents stream object, 32 0 R. So when page 5 is redacted its resources will be renamed and its contents stream will updated conversely. But this also replaces the contens stream for page 7, but its resources will not be renamed. Later on when page 7 is redacted, its already updated contents stream now refers to resources that exist by their original names in its resource dictionary. If the page's contents consisted of an array of streams or if the stream object was entirely missing a new contents stream object would be created, otherwise the contents stream object was updated in place. By updating the contens stream object in place one page's contents would also change another page's contents. The fix therefore to always create a new stream object for the new contents stream of the redacted page.