The customer reports and I've confirmed that Ghostscript takes a long time to render the attached PDF file. This bug file was originally party of Bug 688830. The command line I used for testing: bin/gs -sDEVICE=tiff24nc -sOutputFile=test.tif -r300 ./688830_c.pdf
*** Bug 688830 has been marked as a duplicate of this bug. ***
Created attachment 3503 [details] 689534.zip
The top CPU consumers are garbage collection, image_init_clues(), and c_pdf14trans_read(). image_init_clues() uses inefficient format - an array of structures that fills the CPU cache quickly. Probably, independent arrays for every field in Fortran style will work better. c_pdf14trans_read() uses memcpy to access unaligned data. This is not needed on most architectures. Direct access to unaligned data is acceptable on x86 and has lower performance penalty than calling a function. We cannot count on compilers to replace memcpy() with optimal inlined code. % cumulative self self total time seconds seconds calls s/call s/call name 15.73 25.95 25.95 28462 0.00 0.00 gc_trace 9.79 42.11 16.16 397780 0.00 0.00 image_init_clues 5.48 51.16 9.05 199526748 0.00 0.00 igc_reloc_refs 4.62 58.78 7.62 23765490 0.00 0.00 c_pdf14trans_read 3.99 65.36 6.59 10380 0.00 0.02 interp 3.72 71.50 6.14 96809083 0.00 0.00 refs_set_reloc 3.66 77.54 6.05 24250500 0.00 0.00 gs_create_pdf14trans 3.61 83.50 5.96 96180286 0.00 0.00 refs_compact 3.26 88.88 5.38 239113766 0.00 0.00 ptr_struct_mark 2.59 93.16 4.28 96830745 0.00 0.00 refs_clear_marks 2.50 97.28 4.12 206814362 0.00 0.00 igc_reloc_ref_ptr_nocheck 2.34 101.14 3.87 111402573 0.00 0.00 igc_reloc_ref_ptr 2.13 104.65 3.51 96040028 0.00 0.00 check_l_mark 1.67 107.41 2.76 415682956 0.00 0.00 change_enum_ptrs 1.15 109.31 1.90 821446 0.00 0.00 gc_objects_compact 1.10 111.12 1.81 821446 0.00 0.00 gc_objects_set_reloc 1.04 112.84 1.72 49 0.04 0.87 clist_playback_band 1.03 114.55 1.71 179848619 0.00 0.00 igc_reloc_struct_ptr 1.02 116.23 1.68 103292721 0.00 0.00 change_reloc_ptrs 0.99 117.86 1.64 824000 0.00 0.00 gc_objects_clear_marks 0.94 119.41 1.55 8543205 0.00 0.00 dstack_find_name_by_index 0.88 120.85 1.45 239344279 0.00 0.00 names_mark_index 0.87 122.29 1.44 824000 0.00 0.00 gc_do_reloc 0.86 123.71 1.42 4026 0.00 0.00 alloc_save__filter_changes 0.78 125.00 1.29 103921363 0.00 0.00 change_clear_marks 0.73 126.21 1.21 23765490 0.00 0.00 gs_imager_setmatrix 0.69 127.35 1.14 1986482 0.00 0.00 names_ref 0.65 128.42 1.07 60181733 0.00 0.00 sgets 0.64 129.48 1.06 6613004 0.00 0.00 chunk_locate_ptr 0.59 130.46 0.98 3196441 0.00 0.00 pdf14_mark_fill_rectangle 0.59 131.44 0.98 419297888 0.00 0.00 gcst_get_memory_ptr 0.53 132.32 0.88 213207 0.00 0.00 image_render_color 0.53 133.19 0.88 38119554 0.00 0.00 art_pdf_composite_pixel_alpha_8 0.46 133.95 0.76 30804949 0.00 0.00 i_free_object 0.45 134.69 0.74 239718 0.00 0.00 inflate_fast 0.41 135.36 0.67 96201948 0.00 0.00 refs_do_reloc 0.40 136.02 0.67 23765686 0.00 0.00 read_create_compositor 0.40 136.68 0.66 477358 0.00 0.00 pdf14_pop_transparency_group 0.39 137.32 0.64 23765490 0.00 0.00 sget_matrix
With the default parameters shown this file takes 1035 seconds (138 seconds parse time) -- that's on my 2GHz Intel Core Duo. With the following command line, the 'parse' stage takes 35 seconds and the file completes in 495 seconds. gswin32c -dBufferSpace=40000000 -r300 -sDEVICE=tiff24nc -o nul -c 60000000 setvmthreshold -f 689534.pdf Note that this file writes more than 300Mb to the clist temp files (388Mb for the default settings, 308Mb with the increased BufferSpace). Since the slow parse time can be (mostly) addressed by these parameters, the slow rendering time is assigned to Igor as a clist issue.
Rev 8554 takes 21 minutes on Xeon 3.06GHz for f:\cygwin\bin\time ..\..\gs-hd\bin\gswin32c.exe -IF:/AFPL/gs- hd/lib;f:\afpl\fonts -r144 -dNOPAUSE -dBATCH -sDEVICE=pcx24b - sOutputFile=cur.pcx 689534.pdf It's not a transparency problem, so I delay it until complete with tramsparency.
Folks, I don't see what can I improve here in the kernel level. The document is large, and it needs big resources to rasterize. Particularly, the default band size (13 pixels) is too small for it. It causes 764 bands, and most objects are too fraggled. When I set -dBufferSpace=40000000 -r300, the ducument completes in 20 minutes on the old Zeon 3.05. Since the page size id 53x33 inches, it is equivalent to 20 A4 pages, so we have 1 minuter per 1 A4 page - no so bad. I believe that Ray should write a proper documentation for end users about banding paramenets and memory settings for large pages, so I bumped the priority for bug 689668. Besides that, I think a rewriting of the PDF interpreter in C may help. I see the document causes a lot of PS operations for each small graphic object, and it would be nice to decreese this expense. I recommend Alex to look whether he can optimize the PDF interpreter. Please build gs with XCFLAGS=/DDEBUG_TRACE_PS_OPERATORS and run this : gswin32c.exe -Z!L -dBufferSpace=40000000 -r300 -dNOPAUSE -dNOOUTERSAVE - dBATCH -dEPSFitPage -sDEVICE=tiffg4 -sOutputFile=cur.tif 689534.pdf 2>j:\o It will generate a 2Gb trace in 3 hours, and one can see that we execute 200+ PS operators for each simple object like this : 3321.89 336.66 m 3322.89 336.66 l S Passing the bug to Ray because I believe that the documentation is the critical issue here. If you folks have ideas what to improve in the kernel, I'm open to listen them.
Also if one devides the 300MB clist by 20 A4 pages, he gets a reasonable value 15MB per page. Well it is big, but not dramatically. Maybe it could be some smaller with a better image compression when writing to clist, but I don't expect a rocket speed up here: 300 MB spends only few seconds to wtite to HDD.
On my Intel Core 2 Duo processor, using the following parameters: gswin32c -sDEVICE=tiff24nc -r300 -o nul -dBufferSpace=32000000 -Z: -q -c 32000000 setvmthreshold -f 689534.pdf I see 42 seconds parse time, then 272 sec total time. A problem exists in the clist logic since the memory at render time is supposed to be fairly 'fixed' (only high-level paths cause temporary growth), but watching the memory usage (Task Manager on Windwos, 'top' on linux) I see that the total memory used is about 100K bytes at start of rendering, but GROWS gradually as the rendering progresses to > 580k bytes before the page is done. This may be a memory leak in the clist/graphics library. I will collect more information about the memory use during each band.
Note in previous comment, all memory usages should be M bytes, not K bytes. I transcribed the number of 1K blocks, thus 500K blocks == 500 Mbytes.
The memory leak during clist rendering was traced to lost pdf14 'maskbuf's. In 'pdf14_pop_transparency_mask', if the 'tos' which is to be placed in the ctx->maskbuf has a non-NULL maskbuf, this value will never be used, and will never be freed since the 'pdf14_pop_transparency_group' only uses the tos->maskbuf and will free at most one level deeper. Since we never use it, freeing it when popping the transparency mask seems more efficient than adding recursive free when popping the transparency group. The change which cured the leak for me was: --- src/gdevp14.c (revision 8677) +++ src/gdevp14.c (working copy) @@ -1014,6 +1014,11 @@ if_debug1('v', "[v]pdf14_pop_transparency_mask, idle=%d\n", tos->idle); ctx->stack = tos->saved; + if (tos->maskbuf) { + /* The maskbuf of the ctx->maskbuf entry is never used, free it now */ + pdf14_buf_free(tos->maskbuf, ctx->memory); + tos->maskbuf = NULL; + } ctx->maskbuf = tos; return 0; } =============================================================================== This allows the attached file to complete in 124Mb using the parameters in comment #8 (versus >500M) and incidentally completed in 174 sec. vs. the 272 sec of the original code (a 56% performance improvement). Passing back to Igor for review, since this is in an area he modified most recently.
Patch committed. Keeping open at Igor's request.
I see no reason to keep this bug open. Further performance improvements are always welcome, but not directly relevant to this bug.