Created attachment 6772 [details] The input PDF file for which pdfclean 0.7 causes breakage. PdfClean reproducably "cleans" used fonts out of japanese PDFs. This started occurring with version 0.7, and is reproducable on both compiled and pre-compiled pdfclean binaries, and both on linux and windows. Attached is a pdf which, when cleaned, produces the font-broken pdf in pdfclean 0.7 but not 0.6
Created attachment 6773 [details] the output from pdf when -g option given
Also, adobe reader bookmarks come out messed up - so it's not just content but also the bookmark tree which is affected.
Created attachment 6774 [details] Another input example.
Created attachment 6775 [details] Yet another input example
I assume that these are the kind of errors you see? sebras@host:/tmp$ pdfdraw -m jpn_1.pdf-clean-g.pdf warning: unknown cid collection: �5� ��0� warning: unknown cid collection: �ܖ�-ˮÕ�L warning: unknown cid collection: �������� warning: freetype failed to load glyph: invalid argument warning: freetype load glyph (gid 690): invalid argument warning: freetype failed to load glyph: invalid argument [...] When I attempt to reproduce your problem using a clean checkout of the latest development branch, I fail to reproduce your problem. So to me it seems as if pdfclean might not be properly rebuilt or tagged. I remember breaking pdfclean some time back, but it should be fixed by now. :) sebras@host:/tmp$ pdfclean -g jpn_1.pdf out.pdf && pdfdraw -m out.pdf page out.pdf 1 20ms page out.pdf 2 42ms page out.pdf 3 30ms page out.pdf 4 18ms For jpn_3.pdf I _do_ get four errors on page 65 even for the original file though, and of course they remain after cleaning: sebras@host:/tmp$ pdfdraw -m jpn_3.pdf 65; pdfclean -g jpn_3.pdf out.pdf && pdfdraw -m out.pdf 65 warning: freetype load glyph (gid 654): invalid argument warning: freetype load glyph (gid 654): invalid argument warning: freetype load glyph (gid 654): invalid argument warning: freetype load glyph (gid 654): invalid argument page jpn_3.pdf 65 46ms total 46ms / 1 pages for an average of 46ms fastest page 65: 46ms slowest page 65: 46ms warning: freetype load glyph (gid 654): invalid argument warning: freetype load glyph (gid 654): invalid argument warning: freetype load glyph (gid 654): invalid argument warning: freetype load glyph (gid 654): invalid argument page out.pdf 65 47ms total 47ms / 1 pages for an average of 47ms fastest page 65: 47ms slowest page 65: 47ms If you are calling pdfclean using -ggg then it is a whole different matter, becuase in that situation even I get errors easily reproducible - is this what you experience? sebras@host:/tmp$ pdfclean -ggg jpn_1.pdf out.pdf && pdfdraw -m out.pdf + fitz/filt_flate.c:92: readflated(): zlib error: incorrect header check | fitz/stm_read.c:29: fz_read(): read error | fitz/stm_read.c:77: fz_readall(): read error | mupdf/pdf_stream.c:389: pdf_loadstream(): cannot read raw stream (387 0 R) | mupdf/pdf_page.c:27: pdf_loadpagecontentsarray(): cannot load content stream part 6/7 (387 0 R) | mupdf/pdf_page.c:52: pdf_loadpagecontents(): cannot load content stream array (6 0 R) | mupdf/pdf_page.c:217: pdf_loadpage(): cannot load page contents (6 0 R) | apps/pdfdraw.c:100: drawpage(): cannot load page 1 in file 'out.pdf' \ apps/pdfdraw.c:35: die(): aborting This I will look in to, but I don't think this is the type of problems you have?
You are correct - the bug i noticed was with the single -g argument on the 0.7 TAG, but I checked out the latest git sources and the problem is no longer reproducable if i compile those. Digging into the PDFs, it seems the PDF hex strings <...> were coming out entirely differently in the 0.7 TAG, perhaps it was a decryption issue? Most strings in these pdfs are UCS2 and so should, upon decryption, begin with a BOM (FE FF .... ) . Here's an example of a bookmark where the 0.7 TAG version of pdfclean was obviously incorrect: (from input pdf #1, line 7080, object 693 0) /Title <15ADBE3CE202> (this is Arcfour encrypted) (from TAG 0.7 cleaned pdf #1, line 4853, object 693 0) /Title<32274D4C7C35> ( no BOM .... ) (from git-bleeding-edge cleaned pdf #1, line 4853, object 693 0) /Title<FEFF88687D19> (notice the BOM -- this is correct)
From reading the comments it seems that this is fixed. If not, please reopen with a restatement of what the problem is. Thanks.