Bug 691946 - Conversion to PDF becomes slower and slower
Summary: Conversion to PDF becomes slower and slower
Status: NOTIFIED FIXED
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Writer (show other bugs)
Version: 9.00
Hardware: All All
: P2 enhancement
Assignee: Ken Sharp
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-02-09 09:31 UTC by artifex
Modified: 2011-10-02 02:34 UTC (History)
0 users

See Also:
Customer: 870
Word Size: ---


Attachments
list.pdf (2.05 MB, application/pdf)
2011-02-09 09:32 UTC, artifex
Details

Note You need to log in before you can comment on or make changes to this bug.
Description artifex 2011-02-09 09:31:04 UTC
For generating PDF/A, we convert the file list.pdf to PDF.
The contents of the pages seems very similar. The conversion of the first pages is performed with proper speed, but the time for processing one page growths more and more and it seems, there is an endless loop at the end of the document.

The conversion to an other format ( e.g. TIFFG4 )is performed with proper speed.

GS call:  gs -dNOPAUSE -dBATCH -o out.pdf -sDEVICE=pdfwrite list.pdf
Comment 1 artifex 2011-02-09 09:32:53 UTC
Created attachment 7223 [details]
list.pdf
Comment 2 Ken Sharp 2011-02-09 10:19:22 UTC
This is unlikely to be completely fixable. 

Unlike a bitmap format, PDF requires that certain objects are not written to the final output until the file is complete (for example the pages tree). These objects must be maintained in memory until the file is complete which leads to increased memory use and in the case of some structures adding a new object can grow slower as the number of objects increases (all previously defined objects of this type need to be searched). 

This can be seen when using the -Z: switch which shows the page output time. The time taken to output the page is quick and does not change, what does increase is the time take to interpret each page. 

Checking for reuse of objects also increases as the number of objects and pages increases, particularly when checking whether a font is reused or not. 

A number of performance issues in this area were identified and fixed in 9.0 and will be in the 9.01 release (particularly font reuse), but it is nevertheless true that as the number of pages increases the time taken to add a new page does increase. For what its worth, the same effect is also true of Adobe Acrobat Distiller.

The 'endless loop' at the end of the document is almost certainly caused by the garbage collector freeing all the memory which has been used in the course of creating the PDF file.

All that being said, I will look at this particular file to see where the time is being used and if some optimisation is possible. I suspect the fact that the PDF interpreter loads all fonts anew at the beginning of each page is the main problem, this was recently identified as an issue with non-PDF output; not reloading the fonts can lead to up to 3 times better performance on files with numerous pages, and the benefit to PDF creation is even greater. However not doing so can cause incorrect output, so more work is required in this area.
Comment 3 Ken Sharp 2011-02-09 16:19:28 UTC
The problem is re-use of resources. The original PDF file uses a form on every page, and also uses the same set of images on many of the pages. pdfwrite does not yet support the use of forms, and is unable to detect reuse of images.

As a result each time the form is used its contents are written out to the output PDF file, also each time an image is reused it is written to the destination file.

It looks like the slowdown is mostly caused by the need to add these objects to the xref table, and to write out large quantities of image data (the final file is > 20Mb, 10 times the original).

There's already an enhancement request for preserving form XObjects, I'm puzzled sas to why the image usage should show such a performance decline.

Profiling shows that the majority of the time is spent in comparing dictionaries after finishing an image, I am not yet sure what is being compared or why, continuing to investigate.
Comment 4 Ken Sharp 2011-02-10 09:15:50 UTC
The cause is an existing  feature. 

Every time an image is completed it is compared against all previously encountered images to see if it is the same as any of them. If it is then we reuse the existing resource, otherwise we create a new resource.

In this file although some of the images *are* reused on subsequent pages, many are not. As the number of stored images rises, the time taken to check a new image against all previous images increases, leading to the drop in performance. By disabling this the job proceeds at a more reasonable pace. On page 100 it is taking 14 seconds per page as opposed to 53 seconds per page with the checking in place. However the output file is 50% bigger (30Mb as opposed to 20Mb).

I think that an extension of the hashing already performed on the stream data of composite objects so that it also covers the dictionary data might give better performance without losing the size benefit of reusing images. I'll try coding that and see if it helps.
Comment 5 Ken Sharp 2011-02-18 17:37:44 UTC
I have now done a couple of changes which should help with this. Firstly there is a new command line switch '-dDetectDuplicateImages' which defaults totrue, if you set it to false then pdfwrite will not try and spot duplicate images, and will process this job faster.

Secondly I have completely reworked the code for identification of idenitcal composite objects (in real terms this is things like fonts, images, colour spaces, functions, shadings etc). We now lazily generate an MD5 hash 'fingerprint' when these kinds of objects are compared. That is, if no hash has been generated for an object, generate one before comparison, then compare the hashes.

This makes equality testing much faster. It still takes time, and the more objects that must be compared the longer it takes, so it still gets slower as more objects are stored.

For comparison I ran the list.pdf file three ways:

9.01 : 2642.45 seconds
With the new comparison code : 963.04
With -dDetectDuplicateImages=false: 790.71

The file size for the first 2 cases was 20,453,524, for the case where duplicates are not detected the file size was 33,366,369.

To implement this there were 5 revision r12168 to r12172. With all these in place there should be a significant performance improvement, and there now exists the possibility of a trade-off between performance and output file size.


The final delay at EOJ is due to the garbage collector freeing all the memory used by pdfwrite to track the multitude of objects created during the course of the conversion, and there isn't really very much which can be done about that I'm afraid, at least not in the short term.

I'm closing the bug as FIXED with the revisions noted above.