Bug 691661 - pdfclean writes pdf with missing japanese fonts when given -g option
Summary: pdfclean writes pdf with missing japanese fonts when given -g option
Status: RESOLVED INVALID
Alias: None
Product: MuPDF
Classification: Unclassified
Component: mupdf (show other bugs)
Version: unspecified
Hardware: PC Windows XP
: P4 major
Assignee: Tor Andersson
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-10-02 00:01 UTC by Benjamin Ullian
Modified: 2012-04-20 15:00 UTC (History)
4 users (show)

See Also:
Customer:
Word Size: ---


Attachments
The input PDF file for which pdfclean 0.7 causes breakage. (223.53 KB, application/force-download)
2010-10-02 00:01 UTC, Benjamin Ullian
Details
the output from pdf when -g option given (217.89 KB, application/force-download)
2010-10-02 00:02 UTC, Benjamin Ullian
Details
Another input example. (259.42 KB, application/force-download)
2010-10-02 00:05 UTC, Benjamin Ullian
Details
Yet another input example (391.30 KB, application/force-download)
2010-10-02 00:06 UTC, Benjamin Ullian
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Benjamin Ullian 2010-10-02 00:01:27 UTC
Created attachment 6772 [details]
The input PDF file for which pdfclean 0.7 causes breakage.

PdfClean reproducably "cleans" used fonts out of japanese PDFs. This started occurring with version 0.7, and is reproducable on both compiled and pre-compiled pdfclean binaries, and both on linux and windows.


Attached is a pdf which, when cleaned, produces the font-broken pdf in pdfclean 0.7 but not 0.6
Comment 1 Benjamin Ullian 2010-10-02 00:02:14 UTC
Created attachment 6773 [details]
the output from pdf when -g option given
Comment 2 Benjamin Ullian 2010-10-02 00:02:53 UTC
Also, adobe reader bookmarks come out messed up - so it's not just content but also the bookmark tree which is affected.
Comment 3 Benjamin Ullian 2010-10-02 00:05:46 UTC
Created attachment 6774 [details]
Another input example.
Comment 4 Benjamin Ullian 2010-10-02 00:06:08 UTC
Created attachment 6775 [details]
Yet another input example
Comment 5 Sebastian Rasmussen 2010-10-02 13:46:32 UTC
I assume that these are the kind of errors you see?

sebras@host:/tmp$ pdfdraw -m jpn_1.pdf-clean-g.pdf 
warning: unknown cid collection: �5�
                                    ��0�
                                        warning: unknown cid collection: �ܖ�-ˮÕ�L
warning: unknown cid collection: ��������
                                         warning: freetype failed to load glyph: invalid argument
warning: freetype load glyph (gid 690): invalid argument
warning: freetype failed to load glyph: invalid argument
[...]

When I attempt to reproduce your problem using a clean checkout of
the latest development branch, I fail to reproduce your problem. So to
me it seems as if pdfclean might not be properly rebuilt or tagged. I
remember breaking pdfclean some time back, but it should be fixed by 
now. :)

sebras@host:/tmp$ pdfclean -g jpn_1.pdf out.pdf && pdfdraw -m out.pdf
page out.pdf 1 20ms
page out.pdf 2 42ms
page out.pdf 3 30ms
page out.pdf 4 18ms

For jpn_3.pdf I _do_ get four errors on page 65 even for the original file though, and of course they remain after cleaning:

sebras@host:/tmp$ pdfdraw -m jpn_3.pdf 65; pdfclean -g jpn_3.pdf out.pdf && pdfdraw -m out.pdf 65
warning: freetype load glyph (gid 654): invalid argument
warning: freetype load glyph (gid 654): invalid argument
warning: freetype load glyph (gid 654): invalid argument
warning: freetype load glyph (gid 654): invalid argument
page jpn_3.pdf 65 46ms
total 46ms / 1 pages for an average of 46ms
fastest page 65: 46ms
slowest page 65: 46ms
warning: freetype load glyph (gid 654): invalid argument
warning: freetype load glyph (gid 654): invalid argument
warning: freetype load glyph (gid 654): invalid argument
warning: freetype load glyph (gid 654): invalid argument
page out.pdf 65 47ms
total 47ms / 1 pages for an average of 47ms
fastest page 65: 47ms
slowest page 65: 47ms

If you are calling pdfclean using -ggg then it is a whole different matter, becuase in that situation even I get errors easily reproducible - is this what you experience?
sebras@host:/tmp$ pdfclean -ggg jpn_1.pdf out.pdf && pdfdraw -m out.pdf
+ fitz/filt_flate.c:92: readflated(): zlib error: incorrect header check
| fitz/stm_read.c:29: fz_read(): read error
| fitz/stm_read.c:77: fz_readall(): read error
| mupdf/pdf_stream.c:389: pdf_loadstream(): cannot read raw stream (387 0 R)
| mupdf/pdf_page.c:27: pdf_loadpagecontentsarray(): cannot load content stream part 6/7 (387 0 R)
| mupdf/pdf_page.c:52: pdf_loadpagecontents(): cannot load content stream array (6 0 R)
| mupdf/pdf_page.c:217: pdf_loadpage(): cannot load page contents (6 0 R)
| apps/pdfdraw.c:100: drawpage(): cannot load page 1 in file 'out.pdf'
\ apps/pdfdraw.c:35: die(): aborting

This I will look in to, but I don't think this is the type of problems you have?
Comment 6 Benjamin Ullian 2010-10-02 14:42:56 UTC
You are correct - the bug i noticed was with the single -g argument on the 0.7 TAG, but I checked out the latest git sources and the problem is no longer reproducable if i compile those.

Digging into the PDFs, it seems the PDF hex strings <...> were coming out entirely differently in the 0.7 TAG, perhaps it was a decryption issue?



Most strings in these pdfs are UCS2 and so should, upon decryption, begin with a BOM (FE FF .... ) . Here's an example of a bookmark where the 0.7 TAG version of pdfclean was obviously incorrect:

(from input pdf #1, line 7080, object 693 0)
/Title <15ADBE3CE202>  (this is Arcfour encrypted)


(from TAG 0.7 cleaned pdf #1, line 4853, object 693 0)
/Title<32274D4C7C35>  ( no BOM .... )


(from git-bleeding-edge cleaned pdf #1, line 4853, object 693 0)
/Title<FEFF88687D19>  (notice the BOM -- this is correct)
Comment 7 Robin Watts 2012-04-20 15:00:23 UTC
From reading the comments it seems that this is fixed. If not, please reopen with a restatement of what the problem is. Thanks.