689534 – Poor performance reading PDF file

Bug 689534 - Poor performance reading PDF file

Summary: Poor performance reading PDF file

Status:	NOTIFIED FIXED

Alias:	None

Product:	Ghostscript
Classification:	Unclassified
Component:	PDF Interpreter (show other bugs)
Version:	0.00
Hardware:	PC Linux

Importance:	P2 normal
Assignee:	leonardo

URL:
Keywords:

Duplicates (1):	688830 (view as bug list)
Depends on:	689668
Blocks:
	Show dependency tree

Reported:	2007-10-26 20:19 UTC by Marcos H. Woehrmann
Modified:	2008-12-19 08:31 UTC (History)
CC List:	3 users (show)

See Also:
Customer:	870
Word Size:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Marcos H. Woehrmann 2007-10-26 20:19:25 UTC

The customer reports and I've confirmed that Ghostscript takes a long time to
render the attached PDF file.

This bug file was originally party of Bug 688830.

The command line I used for testing:

  bin/gs -sDEVICE=tiff24nc -sOutputFile=test.tif -r300 ./688830_c.pdf

Comment 1 Marcos H. Woehrmann 2007-10-26 20:19:58 UTC

*** Bug 688830 has been marked as a duplicate of this bug. ***

Comment 2 Marcos H. Woehrmann 2007-10-26 20:29:32 UTC

Created attachment 3503 [details]
689534.zip

Comment 3 Alex Cherepanov 2007-10-27 07:29:52 UTC

The top CPU consumers are garbage collection, image_init_clues(), and 
c_pdf14trans_read().

image_init_clues() uses inefficient format - an array of structures that
fills the CPU cache quickly. Probably, independent arrays for every field
in Fortran style will work better.

c_pdf14trans_read() uses memcpy to access unaligned data. This is not needed
on most architectures. Direct access to unaligned data is acceptable on x86
and has lower performance penalty than calling a function. We cannot count on
compilers to replace memcpy() with optimal inlined code.

  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 15.73     25.95    25.95    28462     0.00     0.00  gc_trace
  9.79     42.11    16.16   397780     0.00     0.00  image_init_clues
  5.48     51.16     9.05 199526748     0.00     0.00  igc_reloc_refs
  4.62     58.78     7.62 23765490     0.00     0.00  c_pdf14trans_read
  3.99     65.36     6.59    10380     0.00     0.02  interp
  3.72     71.50     6.14 96809083     0.00     0.00  refs_set_reloc
  3.66     77.54     6.05 24250500     0.00     0.00  gs_create_pdf14trans
  3.61     83.50     5.96 96180286     0.00     0.00  refs_compact
  3.26     88.88     5.38 239113766     0.00     0.00  ptr_struct_mark
  2.59     93.16     4.28 96830745     0.00     0.00  refs_clear_marks
  2.50     97.28     4.12 206814362     0.00     0.00  igc_reloc_ref_ptr_nocheck
  2.34    101.14     3.87 111402573     0.00     0.00  igc_reloc_ref_ptr
  2.13    104.65     3.51 96040028     0.00     0.00  check_l_mark
  1.67    107.41     2.76 415682956     0.00     0.00  change_enum_ptrs
  1.15    109.31     1.90   821446     0.00     0.00  gc_objects_compact
  1.10    111.12     1.81   821446     0.00     0.00  gc_objects_set_reloc
  1.04    112.84     1.72       49     0.04     0.87  clist_playback_band
  1.03    114.55     1.71 179848619     0.00     0.00  igc_reloc_struct_ptr
  1.02    116.23     1.68 103292721     0.00     0.00  change_reloc_ptrs
  0.99    117.86     1.64   824000     0.00     0.00  gc_objects_clear_marks
  0.94    119.41     1.55  8543205     0.00     0.00  dstack_find_name_by_index
  0.88    120.85     1.45 239344279     0.00     0.00  names_mark_index
  0.87    122.29     1.44   824000     0.00     0.00  gc_do_reloc
  0.86    123.71     1.42     4026     0.00     0.00  alloc_save__filter_changes
  0.78    125.00     1.29 103921363     0.00     0.00  change_clear_marks
  0.73    126.21     1.21 23765490     0.00     0.00  gs_imager_setmatrix
  0.69    127.35     1.14  1986482     0.00     0.00  names_ref
  0.65    128.42     1.07 60181733     0.00     0.00  sgets
  0.64    129.48     1.06  6613004     0.00     0.00  chunk_locate_ptr
  0.59    130.46     0.98  3196441     0.00     0.00  pdf14_mark_fill_rectangle
  0.59    131.44     0.98 419297888     0.00     0.00  gcst_get_memory_ptr
  0.53    132.32     0.88   213207     0.00     0.00  image_render_color
  0.53    133.19     0.88 38119554     0.00     0.00 
art_pdf_composite_pixel_alpha_8
  0.46    133.95     0.76 30804949     0.00     0.00  i_free_object
  0.45    134.69     0.74   239718     0.00     0.00  inflate_fast
  0.41    135.36     0.67 96201948     0.00     0.00  refs_do_reloc
  0.40    136.02     0.67 23765686     0.00     0.00  read_create_compositor
  0.40    136.68     0.66   477358     0.00     0.00  pdf14_pop_transparency_group
  0.39    137.32     0.64 23765490     0.00     0.00  sget_matrix

Comment 4 Ray Johnston 2008-02-27 21:59:51 UTC

With the default parameters shown this file takes 1035 seconds (138 seconds parse
time) -- that's on my 2GHz Intel Core Duo.

With the following command line, the 'parse' stage takes 35 seconds and the file
completes in 495 seconds.

gswin32c -dBufferSpace=40000000 -r300 -sDEVICE=tiff24nc -o nul -c 60000000
setvmthreshold -f 689534.pdf

Note that this file writes more than 300Mb to the clist temp files (388Mb for
the default settings, 308Mb with the increased BufferSpace).

Since the slow parse time can be (mostly) addressed by these parameters,
the slow rendering time is assigned to Igor as a clist issue.

Comment 5 leonardo 2008-02-28 01:40:27 UTC

Rev 8554 takes 21 minutes on Xeon 3.06GHz for 

f:\cygwin\bin\time ..\..\gs-hd\bin\gswin32c.exe -IF:/AFPL/gs-
hd/lib;f:\afpl\fonts -r144 -dNOPAUSE -dBATCH -sDEVICE=pcx24b -
sOutputFile=cur.pcx 689534.pdf

It's not a transparency problem, so I delay it until complete with tramsparency.

Comment 6 leonardo 2008-04-27 23:47:44 UTC

Folks, I don't see what can I improve here in the kernel level. The document is 
large, and it needs big resources to rasterize. Particularly, the default band 
size (13 pixels) is too small for it. It causes 764 bands, and most objects are 
too fraggled. When I set -dBufferSpace=40000000 -r300, the ducument completes 
in 20 minutes on the old Zeon 3.05. Since the page size id 53x33 inches, it is 
equivalent to 20 A4 pages, so we have 1 minuter per 1 A4 page - no so bad. I 
believe that Ray should write a proper documentation for end users about 
banding paramenets and memory settings for large pages, so I bumped the 
priority for bug 689668.

Besides that, I think a rewriting of the PDF interpreter in C may help. I see 
the document causes a lot of PS operations for each small graphic object, and 
it would be nice to decreese this expense. I recommend Alex to look whether he 
can optimize the PDF interpreter. Please build gs with 
XCFLAGS=/DDEBUG_TRACE_PS_OPERATORS and run this :

gswin32c.exe -Z!L -dBufferSpace=40000000  -r300 -dNOPAUSE -dNOOUTERSAVE -
dBATCH -dEPSFitPage -sDEVICE=tiffg4 -sOutputFile=cur.tif 689534.pdf 2>j:\o

It will generate a 2Gb trace in 3 hours, and one can see that we execute 200+ 
PS operators for each simple object like this :

3321.89 336.66 m
3322.89 336.66 l
S

Passing the bug to Ray because I believe that the documentation is the critical 
issue here.

If you folks have ideas what to improve in the kernel, I'm open to listen them.

Comment 7 leonardo 2008-04-27 23:55:16 UTC

Also if one devides the 300MB clist by 20 A4 pages, he gets a reasonable value 
15MB per page. Well it is big, but not dramatically. Maybe it could be some 
smaller with a better image compression when writing to clist, but I don't 
expect a rocket speed up here: 300 MB spends only few seconds to wtite to HDD.

Comment 8 Ray Johnston 2008-04-28 10:06:35 UTC

On my Intel Core 2 Duo processor, using the following parameters:

gswin32c -sDEVICE=tiff24nc -r300 -o nul -dBufferSpace=32000000 -Z: -q -c
32000000 setvmthreshold -f 689534.pdf

I see 42 seconds parse time, then 272 sec total time.

A problem exists in the clist logic since the memory at render time is supposed
to be fairly 'fixed' (only high-level paths cause temporary growth), but
watching the memory usage (Task Manager on Windwos, 'top' on linux) I see that
the total memory used is about 100K bytes at start of rendering, but GROWS 
gradually as the rendering progresses to > 580k bytes before the page is done.

This may be a memory leak in the clist/graphics library. I will collect more
information about the memory use during each band.

Comment 9 Ray Johnston 2008-04-28 11:18:53 UTC

Note in previous comment, all memory usages should be M bytes, not K bytes. I
transcribed the number of 1K blocks, thus 500K blocks == 500 Mbytes.

Comment 10 Ray Johnston 2008-04-30 12:27:57 UTC

The memory leak during clist rendering was traced to lost pdf14 'maskbuf's.

In 'pdf14_pop_transparency_mask', if the 'tos' which is to be placed in the
ctx->maskbuf has a non-NULL maskbuf, this value will never be used, and will
never be freed since the 'pdf14_pop_transparency_group' only uses the tos->maskbuf
and will free at most one level deeper. Since we never use it, freeing it when
popping the transparency mask seems more efficient than adding recursive free
when popping the transparency group.

The change which cured the leak for me was:

--- src/gdevp14.c       (revision 8677)
+++ src/gdevp14.c       (working copy)
@@ -1014,6 +1014,11 @@
 
     if_debug1('v', "[v]pdf14_pop_transparency_mask, idle=%d\n", tos->idle);
     ctx->stack = tos->saved;
+    if (tos->maskbuf) {
+       /* The maskbuf of the ctx->maskbuf entry is never used, free it now */
+       pdf14_buf_free(tos->maskbuf, ctx->memory);
+       tos->maskbuf = NULL;
+    }
     ctx->maskbuf = tos;
     return 0;
 }
===============================================================================

This allows the attached file to complete in 124Mb using the parameters in
comment #8 (versus >500M) and incidentally completed in 174 sec. vs. the 272
sec of the original code (a 56% performance improvement).

Passing back to Igor for review, since this is in an area he modified most
recently.

Comment 11 Ray Johnston 2008-05-06 22:54:02 UTC

Patch committed. Keeping open at Igor's request.

Comment 12 Ray Johnston 2008-07-16 23:07:45 UTC

I see no reason to keep this bug open. Further performance improvements are
always welcome, but not directly relevant to this bug.