708032 – Incorrect full page redactions

Bug 708032 - Incorrect full page redactions

Summary: Incorrect full page redactions

Status:	RESOLVED FIXED

Alias:	None

Product:	MuPDF
Classification:	Unclassified
Component:	mupdf (show other bugs)
Version:	master
Hardware:	All All

Importance:	P2 normal
Assignee:	MuPDF bugs

URL:
Keywords:

Depends on:
Blocks:

Reported:	2024-09-15 21:46 UTC by Jorj
Modified:	2024-09-27 20:46 UTC (History)
CC List:	1 user (show)

See Also:
Customer:
Word Size:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jorj 2024-09-15 21:46:56 UTC

PyMuPDF bug: https://github.com/pymupdf/PyMuPDF/issues/3863

Problem file: https://github.com/user-attachments/files/17007483/test2.pdf

Problem description:
This file contains 8 pages with redaction rectangles covering each page completely.
The pages consist of scanned images only - no text, no vector graphics.
Some pages are rotated.
When applying the redactions using "images=0, graphics=0, text=0", some of the rotated pages are fully emptied, whereas the expected result is a no-op.
The following error messages are displayed: "syntax error: cannot find XObject resource 'Im1'".

Comment 1 Sebastian Rasmussen 2024-09-27 01:14:13 UTC

I can reproduce this by opening the input file in mupdf-gl, pressing 'R' and selecting not to draw black boxes and selecting to keep images, graphics and text and then pressing "Redact document".

Comment 2 Sebastian Rasmussen 2024-09-27 01:16:25 UTC

This input file is tricky! Page 5 stored in PDF object 15 0 R has /Contents 32 0 R, but so has page 7 stored in PDF object 17 0 R. Note that it is not the resource dictionaries that are shared between pages, it is the content stream! Presumably because the scanned images just happened to be exactly the same size and needed to be scaled by the same amount?

So when pdf_redact_page() and its collaborator pdf_filter_content_stream() goes and filters the contents of page 5 and renames the image resources in the page resource dictionar of that page it ALSO renames the page resources of page 7, but the content stream of page 7 is not updated.

This leaves the page 7 content stream and its page resources in a broken state.

Later on when mupdf-gl's step_redact_all_pages() calls pdf_redact_page() for page 7 the image resources can not be found under the name expects in the page's resource dictionary.

Comment 3 Sebastian Rasmussen 2024-09-27 01:23:11 UTC

Knowing that and looking at the source I think that the problem is in pdf_filter_page_contents() because may create a new page Contents stream object, but only if the contents Stream object not a stream (in this case it was presumably an array of streams). If this were to create a new page Contents stream object unconditionally this file can be redacted without problems.

Comment 4 Jorj 2024-09-27 12:45:18 UTC

Interesting! I did not realize the reuse of the same content objects between different pages.
Is that even legal as per PDF spec? I did not find explicit comments in this regard.

Comment 5 Sebastian Rasmussen 2024-09-27 20:45:29 UTC

If it is not explicitly forbidden it must be allowed?

Comment 6 Sebastian Rasmussen 2024-09-27 20:46:04 UTC

Fixed in

commit 934cc6babad2d389d5fbe4128183c628107443de
Author: Sebastian Rasmussen <sebras@gmail.com>
Date:   Fri Sep 27 03:23:27 2024 +0200

    Bug 708032: When redacting pages, create new content stream objects, do not replace them.
    
    In the file from the bug both page 5 and page 7 refer to the same
    contents stream object, 32 0 R.
    
    So when page 5 is redacted its resources will be renamed and its
    contents stream will updated conversely. But this also replaces the
    contens stream for page 7, but its resources will not be renamed.
    Later on when page 7 is redacted, its already updated contents stream
    now refers to resources that exist by their original names in its
    resource dictionary.
    
    If the page's contents consisted of an array of streams or if the
    stream object was entirely missing a new contents stream object would
    be created, otherwise the contents stream object was updated in place.
    By updating the contens stream object in place one page's contents
    would also change another page's contents.
    
    The fix therefore to always create a new stream object for the
    new contents stream of the redacted page.