Bug 693545 - remove identical streams
Summary: remove identical streams
Status: RESOLVED FIXED
Alias: None
Product: MuPDF
Classification: Unclassified
Component: mupdf (show other bugs)
Version: master
Hardware: PC Linux
: P4 enhancement
Assignee: Tor Andersson
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-01-10 02:01 UTC by liucougar
Modified: 2013-01-15 10:48 UTC (History)
2 users (show)

See Also:
Customer:
Word Size: ---


Attachments
deduplicate identical streams (1.56 KB, patch)
2013-01-10 02:03 UTC, liucougar
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description liucougar 2013-01-10 02:01:54 UTC
the attached patch adds support for deduplicating identical streams

it does not decompress streams, instead the two objects have to be exactly the same (same filter, same length), before the compressed streams are compared. if the streams are identical, remove one of them

please let me know what you think about this enhancement. thanks
Comment 1 liucougar 2013-01-10 02:03:47 UTC
Created attachment 9203 [details]
deduplicate identical streams
Comment 2 Robin Watts 2013-01-11 16:32:36 UTC
The patch as supplied does not work as it's incorrectly dealing with fz_buffers and memcmping the incorrect length. Also, there are potential problems in the error handling.

I have a fix based on this same idea going through review now though, and will update the bug here when it is committed.

http://git.ghostscript.com/?p=user/robin/mupdf.git;a=commitdiff;h=6bc1ca3cfc19440b99c2efc919c2ec607fa51666

Many thanks!
Comment 3 Robin Watts 2013-01-11 17:48:45 UTC
Fixed in:

commit e145b71a5a7462660e210d40ada498e01c7407a3
Author: Robin Watts <robin.watts@artifex.com>
Date:   Fri Jan 11 16:18:05 2013 +0000

    Bug 693545: Extend pdfwrite to remove identical streams.

    When writing pdf files, we currently have the option to remove duplicate
    copies of objects; all streams are treated as being different though.

    Here we add the option to spot duplicate streams too.

    Based on a patch submitted by Heng Liu. Many thanks!

Many thanks!
Comment 4 liucougar 2013-01-11 19:02:05 UTC
thanks for landing this

should pdfclean be modified to add something like the following to its usage message?

		"\t-gggg\tin addition to -ggg merge duplicate objects with streams\n"
Comment 5 zeniko 2013-01-11 20:08:01 UTC
(In reply to comment #3)
Shouldn't it be

> if (lena == lenb && memcmp(dataa, datab, lena) == 0)
>     differ = 0;

i.e. the two lengths must match, else with lena > lenb this will cause a read access violation (and with lena < lenb this might consider streams identical which aren't)?
Comment 6 Robin Watts 2013-01-15 10:48:01 UTC
Fixed in:

commit 7231417c1e4cf1c8a5601a54a24e6366bee3a8c9
Author: Robin Watts <robin.watts@artifex.com>
Date:   Sat Jan 12 11:49:25 2013 +0000

    Bug 693545: Fix typo in previous commit.

    When adding code to spot identical streams, I got the logic in
    a test reversed as a result of a last minute change. Corrected here.
    Thanks to zeniko for pointing this out.