Summary: | pdfinflt.ps creates an incomplete PDF | ||
---|---|---|---|
Product: | Ghostscript | Reporter: | Igor Melichev <igor.melichev> |
Component: | Test Framework | Assignee: | Alex Cherepanov <alex> |
Status: | RESOLVED WONTFIX | ||
Severity: | normal | CC: | christinedelight.top85, leonardo, sags5495 |
Priority: | P3 | Keywords: | bountiable |
Version: | master | ||
Hardware: | PC | ||
OS: | Windows XP | ||
Customer: | Word Size: | --- | |
Attachments: |
cur.pdf
cur1.pdf Patch for "Error: /rangecheck in --get--" etc. Additional test files for "Error: /ioerror in --readstring--". Patch for "dicts remain on the o-stack". "Consecutive" diffs for (H), (I), and (J). |
Description
Igor Melichev
2004-11-13 14:19:36 UTC
Does using pdfinflt.ps on those files end with an "Error: /ioerror in --readstring--"? Or an "Error: /undefined in --get--"? Does this seem to be the same problem encountered with the PDFs freely downloadable from http://www.acumentraining.com/AcumenJournal.html? (Let's choose Volume 9, September 2001; this is affected by both problems, although the 2nd one becomes visible only after fixing the first.) NO, pdfinflt finishes with no error, but the generated PDF looks truncated. Will attach an exmple. Created attachment 1052 [details] cur.pdf An incomplete PDF ctreated from 000040cf.000_60.pdf. Created attachment 1053 [details] cur1.pdf An incomplete PDF created from 86554321.pdf . P2 and P1 are for customer bugs only I don't know where could the type of damage in cur.pdf (from comment #3) come from. cur1.pdf from comment #4 seems to be affected by (A) "Error: /ioerror in --readstring--" below, the only thing I don't understand being how could pdfinflt.ps finish with no error in such a case. Here's a list of the problems I found, with pointers to more information and patches: (A) Symptom: "Error: /ioerror in --readstring--" with encrypted PDFs. See comment #8 below for details and a patch, and comment #9 for some more test files. (B) Symptom: "Error: /undefined in --get--" with encrypted PDFs. Note that usually this error is masked by (A) above. See bug #688149 "Problems, including one security-related, with handling dictionaries". (C) Symptom: "Error: /undefined in --get--" (again), this time with PDF1.5+ xref streams. See bug #688152 "'Undefined in get' and extra trailer keys with pdfwrite.ps and PDF1.5+" (D) Symptom: "Error: /undefined in /.bigstring" with PDF1.5+ xref streams. See bug #688151 "PDF interpreter needs languagelevel 3, not 2", also bug #688150 "ReusableStreamDecode available but failing at languagelevel 2". (E) Miscellaneous: - "Error: /rangecheck in --get--" if input PDF contains "/Filter []"; - abbreviated filter names are not processed; - /JBig2Dec filter, if activated via /filterstoremove, is not processed correctly. For these see comment #7 below. Applying all the patches may produce conflicts, expecially in pdfinflt.ps. Comment #10 includes its final version, with (A) and (E) applied and with the indents corrected according to ps_style.htm. Created attachment 1462 [details] Patch for "Error: /rangecheck in --get--" etc. (E.i) For PDFs containing streams with "/Filter []" (empty array!), current pdfinflt.ps fails with "Error: /rangecheck in --get--". I found such files generated by PowerPdf (author Takeshi Kanno; I think there is at least one other piece of software with the same name). See, for example, LineExample.pdf at http://www.est.hi-ho.ne.jp/takeshi_kanno/powerpdf/ . (E.ii) Abbreviated filter names are not recognized and thus not applied. Although legal only inside content streams, these are accepted by Adobe viewers (and by Ghostscript, I think) everywhere a filter name is expected. (E.iii) The /JBig2Dec filter is currently not applied, but, if need would be to activate it, uncommenting the line in /filterstoremove is not enough. The current code misses the special processing for /JBIG2Globals. --- The suggested patch changes the way filters are processed. The old code extracted and applied them one by one. The new code splits the /Filter and /DecodeParms in 2 parts. 1st part, the filters to apply, is passed to lib\pdf_base.ps::/applyfilters. The rest goes into the output PDF. I prefered to use proven-and-tested procs and avoid duplicating code, even if the new code is a bit larger (PostScript is not so good at manipulating arrays...). Created attachment 1463 [details] Patch for "Error: /ioerror in --readstring--". Current pdfinflt.ps fails with "Error: /ioerror in --readstring--" when attempting to process encrypted PDFs. The attached patch makes pdfinflt.ps work on encrypted PDFs, with the condition of having owner rights on that PDF. Note there's one more bug affecting the processing of encrypted PDFs, (B) in comment #6 above, that becomes visible after applying this patch. - The owner password must be known (-sPDFPassword=...) or must be blank. Otherwise pdfinflt.ps prints a message and quits (even if the PDF's author set no restrictions). If the owner password is blank but not the user password, one must supply the non-blank (user) one. (Cannot get this combination with Distiller, tricky to get with Ghostscript). - PDFs with owner password == user password == not blank are rejected even if the correct password is supplied, to conform to the 2nd note on page 67 of the PDF Reference Manual, 2nd edition (PDF1.3), quoted below: "Note: If the owner and user passwords are the same, the document is always opened with user access privileges. It is therefore impossible in these circumstances to obtain owner privileges for the document." I haven't found this exact note in later editions, only remains of it, like "Opening the document with the correct owner password (assuming it is not the same as the user password)" - note the text in paranthesis - on page 95 of the 4th edition (PDF1.5). - lib\pdfwrite.ps::/pdfcopystream has 2 more parameters, but the change in backward compatible. The new parameters remain on the o-stack after /pdfcopystream's return, so old customisations of this proc will simply ignore them and work as they did before, without any change. <tostreamkey|null> is not used by pdfinflt.ps, because the latter outputs unencrypted PDFs. However, I included it for the future (add/ remove encryption, merge PDFs, etc, all of this without redistilling). - Encryption comes with some legal trouble. IMHO the attached patch is compliant. However, I'm not a lawyer, so seek qualified legal advice before making this code public. The same trouble currently exists with Ghostscript's PDF interpreter; this patch doesn't fix anything there. Created attachment 1464 [details]
Additional test files for "Error: /ioerror in --readstring--".
ZIP file containing encrypted PDFs with all combinations of empty/
non empty passwords. The set includes 3 more exotic files, 2 with blank
owner passwords and one with user password == owner password.
PDF filenames specify the passwords, with "none" for blank ones.
Created attachment 1465 [details] Final toolbin\pdfinflt.ps. With (A) and (E) (comment #6) applied and indents corrected. The patches for (A) and (E) don't have correct indents, to shrink the diffs. Note that (A) changes lib\pdfwrite.ps also, and this change is needed too. Erratum concerning comment #8: > "If the owner password is blank but not the user password, one must > supply the non-blank (user) one." In fact, it IS possible to specify a blank password on the command line with "-sPDFPassword=" (nothing following the "="), so one does not have to specify the user password in this case. Making this bountiable (hopefully SaGS will be able to provide patches and thus harvest a bounty). Created attachment 1498 [details] Revised patch for "Error: /ioerror in --readstring--". Trying to keep the patches for (A) and (C) (see comment #6) separate, the suggested patch for (A) reproduces the anomaly that causes (C). Since the patch for (C) (in bug 688152 "'Undefined in get' and extra trailer keys with pdfwrite.ps and PDF1.5+") is for head, it does not fix the new code. Here I replace the patch for (A), found in comment #8, with one that does not include the mentioned anomaly, so that after applying both patches both problems are solved. Created attachment 1499 [details]
Patch for "dicts remain on the o-stack".
A 6th problem with pdfinflt.ps:
(F) Symptoms: none visible, but tools\pdfinflt.ps::/pdfcopystream pushes
a dict on the o-stack that it never uses and forgets it there.
With all PDFs I tested this produced no error message, and the output
was always OK. Examining lib\pdfwrite.ps::/pdfwrite, it seems this will
be the case with any file. However, the o-stack grows with one element
per PDF stream and this may create problems in the future; plus, a
possible o-stack overflow with really huge PDFs.
Created attachment 1500 [details] Final toolbin\pdfinflt.ps. With the patches for (A) (the revised one, from comment #13), (E) and (F) applied and indents corrected. Created attachment 1501 [details] Final toolbin\pdfinflt.ps. Sorry, for comment #15 I picked up a wrong copy. Two more problems with the PDF interpreter (the PDF loader, to be more precise) that also affect pdfinflt.ps: (G) Symptom: with hybrid-ref PDFs, objects accessible only to PDF1.5+ viewers (via /XRefStm) are not written in the output file, or are written as null objects. The output file is otherwise valid; it looks like the original displayed by a PDF1.4 viewer. See bug #688282 "Xref-streams ignored in hybrid-ref PDFs" for details and a patch. (H) Symptom: same as (G), but with compressed object streams and an xref table that is (or the PDF loader considers it to be) invalid, so the xref rebuild logic is triggered. See bug 688283 "'trailer' without 'startxref'" (and the sample mentioned there) for such a case. The base cause of this is that the xref rebuild logic does not detect and handle compressed object streams. (Q: Should I file a separate bug report for this?) Back to the file in comment #3: there are almost no chances to do anything about it without a sample file. Especially since I believe comment #2 ("NO, pdfinflt finishes with no error") is wrong. Having something written into the output file means pdfwrite.ps::pdfwrite has been entered. pdfwrite unconditionally calls pdfwritestartxref, which unconditionally writes "%%EOF" to the output. So, unless an error occurs we have "%%EOF". These procedures are not being executed in a stopped context, so errors are not masked. Another scenario: the output file is closed prematurely, so pdfwritestartxref cannot write the "%%EOF". But then it gets an ioerror when attempting to, doesn't it? 3rd scenario, equaly unlikely: GS closes without flushing file buffers. Then, the outcome is meaningless and the problem could be anything. I do understand those files are private. But what if you overwrite all streams's contents with spaces, using a binary editor (assuming there's a file of reasonable size)? Then, I guess, all real contents, sensitive, copyrighted or otherwise, is gone. I *think* the modified file will be usefull for debugging. Looking at where the output is truncated, I don't think the error happens while decompressing a stream, but while walking the object graph. Assigning to alex to evaluate and approve the patches. *** Bug 688835 has been marked as a duplicate of this bug. *** The patch in Comment #16 fails with all files listed in Description. The mostly frequent failure is "The owner password is required to process this file". In same time, 000040cf.000_60.pdf is not encrypted (has no Encrypt dictionary at all). 86554321.pdf has 2 xrefs and 2 trailer dictionaries, the first one includes /Encrypt, the second ones does not. Adobe open all them with no password, and Ghostscript reads them with no password when running with no pdfinflt.ps . The full list of failures : 000040cf.000_60.pdf - Undefined in /BXlevel 86554321.pdf - the owner password AdobeLic.pdf - the owner password Es001-01.pdf - the owner password ICPconcept.pdf - the owner password Jahr2000.pdf - the owner password NECPNTD.pdf - the owner password p2b-100.pdf - the owner password rf1025.pdf - the owner password RodinCIDEmbed.pdf - the owner password test.pdf - the owner password test3.pdf - the owner password test_multipage_prob.pdf - syntaxerror in --token-- The last file prints this : **** Warning: File has an invalid xref entry: 4. Rebuilding xref table. Converting H:\AuxFiles\comparefiles\test_multipage_prob.pdf to decompr- test_multipage_prob.pdf **** Warning: stream operator not terminated by valid EOL. Error: /syntaxerror in --token-- (Well, the xref is bad, but we still want to decompress the file). One useful change has been committed to HEAD : http://ghostscript.com/pipermail/gs-cvs/2006-September/006822.html Dear SaGS, I viewed your code in Comment #16, and I apologize that we had no resources to understand it. The change is too big. I suggest you to divide the change in consecutive patches with appropriate comments, so that we can understand and test it step by step. Created attachment 2547 [details] "Consecutive" diffs for (A), (E), and (F). "Consecutive" diffs and other help ------- I had previously attached separate diffs for the 3 patches that change pdfinflt.ps, but if you find "consecutive" patches easier to work with, the ZIP attached to this comment contains them. These are the patches for (F) (tame), (E) (easy), and (A), in this order, starting from SVN TRUNK -r7114. To help with (A), I can send you the file I used to create the encoded portion of pdfinflt.ps; just tell me to whom should I send it (I notice the latest comments are not posted by the same person the bug is assigned to). It contains the code in clear text, with comments. I'll post (comment #26) a summary of the many bugs refered to. About "The owner password is required..." ------- For encrypted PDFs, use "-sPDFPassword=<ownerpassword>", as mentioned in comment #8, with an erratum in comment #11. This is a must. If the USER password in blank, neither Adobe Reader nor GS ask for a password. This is ok for VIEWING the PDF. pdfinflt.ps, however, has to REMOVE ENCRYPTION (and thus any protection) to fulfill its goal. To do this, it is mandatory to ask for the OWNER password. See PDF Reference Manual for PDF 1.4 and later plus applicable errata, section 1.4/1.5 (varies with the edition) "Intelectual property", and read the paragraph containing "Authors ... must make reasonable efforts to ...". AFAIK Adobe initially introduced this requirement in November 2001, and gave it the current form between 2003/05/07 and 2003/06/18. This is what I'm reffering to in the last paragraph of comment #8, and the reason I marked all attachments dealing with this issue as private. About "Undefined in /BXlevel" and invalid PDFs in general ------- In general, to add code for handleing a certain class of invalid PDFs I need samples of those PDFs. That being said, see comment #25. Other notes ------- - The patches posted 1 year ago were each relative to CVS HEAD, and when applying all of them there are some conflicts in pdfinflt.ps. The file in comment #16 is the result of solving these conflicts and correcting the indents. This final file is not suitable for looking at the changes. - You cannot just paste pdfinflt.ps from comment #16 into a "stock" copy and GS. The patch for (A) makes a related change to lib\pdfwrite.ps too. Created attachment 2548 [details] "Consecutive" diffs for (H), (I), and (J). - (I) and (J) are new and are described below. - (H) has already been described in bug #688283, but that patch conflicts with the one for (J). Attached archive includes a "consecutive" diff for (H). Two more patches for the PDF loader: (I) Symptoms: - "Error: /undefined in /BXlevel" - sometimes not reporting unrecognized tokens The PDF content stream operators BX/EX are used to suppress error reporting (so-called "compatibility sections"). To implement them, GS tracks the nesting level of BX/EX in "BXlevel", and ignores unknown executable names found while BXlevel > 0. However, the procedure that examines BXlevel and reports errors (lib\pdf_base.ps::.pdftokenerror) is called from outside content streams too. This has 2 categories of consequences: - If GS is currently before the 1st page, after the last, or between pages, an "Error: /undefined in /BXlevel" results because BXlevel is created by lib\pdf_main::pdfshowpage_finish and available only while interpreting the pages. This always happens with pdfopt.ps and pdfinflt.ps, because these never interpret pages. It also happens if the invalid token is found in the file trailer, during the rebuild, maybe in other cases too. - If GS loads a PDF object with BX in effect, invalid tokens in this object won't be reported. According to my reading of the PDF Ref and experimenting with Adobe Reader, BX is supposed to hide only unknown operators in contents streams, not any unknown tokens in objects (like resources) that happen to be loaded while interpreting a content stream with BX in effect. There are other filed-and-fixed bugs that lead to "Error: /undefined in /BXlevel" (bug #688675, bug #688695, bug #688787), but each time the change was to work around the particular PDF invalidness. Current patch corrects the use of BXlevel as follows: - the <opdict> for content streams gets a special marker; - .pdftokenerror examines BXlevel if and only if it finds the marker in <opdict>, meaning the invalid token came from a content stream and not from anywhere else. (J) [Better] error recovery in the PDF loader Currently, GS exits if an error happens during .[dec]pdfrun, the exception being that if this happens while reading the file trailer GS triggers the rebuild logic. This is not good enough for using toolbin\pdfinflt.ps with damaged PDFs. While handling absolutely any damage is not really feasible, the attached patch improves the situation. Changes: - lib\pdf_base.ps::.pdftokenerror does a "stop" if finding an unknown token anywhere but in content streams. There's no point to ignore the token and continue, because a missing token makes keys in a dict to become values and vice-versa, elements' indices in PDF arrays matter, etc. - lib\pdf_sec::.decpdfrun is brought to the same level as lib\pdf_base::.pdfrun; there are some tweaks in .pdfrun not made in .decpdfrun (not all connected to error recovery, for example .pdffixname is but PDFStepCount isn't). - The stack depth computed by lib\pdf_base::.pdfrun was 1-off; this was compensated by a 1-off (in the opposite direction) in lib\pdf_base.ps::.pdftokenerror. - lib\pdf_base.ps::.pdfruncontext was not restoring the context (LocalResources, DefaultQstate) if case of errors. Was not so important, because GS was aborting in such a case, but now we'd like to continue. - Calls to .[dec]pdfrun trap errors, and act as follows: - errors in the trailer dictionary are not recoverable (actually, they trigger a rebuild, but the error will be encountered again and GS aborts); - errors in the xref table trigger the rebuild; there's no change here compared to unpached GS; - invalid indirect objects are replaced with "null"; - unknown operators (= invalid executable names) in content streams are ignored, together with their operands; this is the same behaviour as before the patch. About object streams: There are 2 definitons of lib\pdf_base.ps::resolveobjectstream. - The 1st, not used, is less performant than needed. - The 2nd, actually used, aborts if an error happens while scanning the stream (because it will find an incorrect object count upon return from .pdfrun). Since it does a "raw" scan, ignoring the individual object byte limits, it is not possible to replace a damaged object with null and continue with the next one. If you consider the preformance less important than dealing with damaged PDFs, I suggest to switch to the 1st implementation. I think the 1st implementation can be made less complicated. I also think (but have not verified) the 2nd implementation is incorrect. About all of this, another time, in another bug report. (This report is already too crowded.) Order to apply the "consecutive" diff files ------- The are 2 sequences of dependencies: i : apply the "consecutive" diffs for (F) then (E) then (A) ii : apply the "consecutive" diffs for (I) then (J) then (H) Not that the fixes are fundamentally interdependent (except that (J) needs the "36#pgops" marker introduced by (I)), but the modified lines are physically too close for GNU patch to find the context it needs and apply the diffs. Summary of patches ------- Patches that touch toolbin\pdfinflt.ps [plus another file]: (F) Dicts forgotten on the o-stack. Details: comment #14 Patch: - initially attachment #1499 [details] (from comment #14) - "consecutive" patch in attachment #2547 [details] (from comment #24) as Bug687796-r7114-to-r7114F.diff.txt (E) Miscellaneous. Details: comment #7 Patch: - initially attachment #1462 [details] (from comment #7) - "consecutive" patch in attachment #2547 [details] (from comment #24) as Bug687796-r7114F-to-r7114FE.diff.txt (A) Ability to inflate (and decrypt) encrypted files. Details: comment #8 Patch: - initially attachment #1498 [details] (from comment #13) - "consecutive" patch in attachment #2547 [details] (from comment #24) as Bug687796-r7114FE-to-r7114FEA.diff.txt Patches that fix bugs in the PDF loader: (C) Bug #688152 "'Undefined in get' and extra trailer keys with pdfwrite.ps and PDF1.5+" Details: bug #688152 Patch: attachment #1652 [details] (from bug #688152 comment #4) (G) Bug #688282 "Xref-streams ignored in hybrid-ref PDFs" Details: bug #688282 Patch: attachment #1650 [details] (from bug #688282 comment #1) (I) Incorrect use of "BXlevel" Details: comment #25 point (I) Patch: - "consecutive" patch in attachment #2548 [details] (from comment #25) as Bug687796-r7114-to-r7114I.diff.txt Patches that improve PDF loader's behaviour when given incorrect PDFs: (J) [Better] error recovery in the PDF loader Details: comment #25 point (J) Patch: - "consecutive" patch in attachment #2548 [details] (from comment #25) as Bug687796-r7114I-to-r7114IJ.diff.txt (H) Bug #688283 "'trailer' without 'startxref'" Details: bug #688283 Patch: - initially attachment #1651 [details] (from bug #688283 comment #1) note: a pach commited in the meantime created a conflict; see bug #688283 comment #2. - "consecutive" patch in attachment #2548 [details] (from comment #25) as Bug687796-r7114IJ-to-r7114IJH.diff.txt Patches that are already committed: (B) Bug #688149. (D) Bug #688150. This is a very elderly bug that has recently fallen to me. My first inclination is that we should no longer support pdfinflt.ps, if customers need this functionality I think we should suggest they use MuPDF for it. A number of other issues are 'probably' resolved one way of another over the last 7 or 8 years. For example the PDF interpreter now does a lot more towards ignoring errors (and not triggering the rebuild case). But, I feel bad that SaGS has done a lot of work on this for no reward. Some of his patches have already been adopted and so for me the right way to deal with this is to close the bug and give SaGS the bounty for it. I will bring this up at the IRC meeting tonight. Discussed on IRC and the consensus agrees with me. We would prefer to deliver this functionality using MuPDF in the future. SaGS we would like you to collect the bounty on this bug, I suspect you know how to proceed on that, please contact one of us if not, or follow up here. |