Bug 690346

Summary: many pcl files from the customer test file directory have problems.
Product: GhostPCL Reporter: Henry Stiles <henry.stiles>
Component: PCL interpreterAssignee: Henry Stiles <henry.stiles>
Status: RESOLVED FIXED    
Severity: normal CC: ralph.giles
Priority: P4    
Version: master   
Hardware: Macintosh   
OS: MacOS X   
Customer: Word Size: ---

Description Henry Stiles 2009-03-19 19:45:59 UTC
390002wts.pcl has a severe performance problems and these have no output:

ERROR running 52 of 7884 tests
 tests_private/customer_tests/LP085R16.ONE (pbmraw 600dpi)
no output
 tests_private/customer_tests/LP085R16.ONE (ppmraw 600dpi)
no output
 tests_private/customer_tests/LP085R16.ONE (bitcmyk 600dpi)
no output
 tests_private/customer_tests/LP085R16.TWO (ppmraw 600dpi)
no output
 tests_private/customer_tests/LP085R16.TWO (pbmraw 600dpi)
no output
 tests_private/customer_tests/LP085R16.ONE (bitrgb 600dpi)
no output
 tests_private/customer_tests/LP085R16.TWO (bitrgb 600dpi)
no output
 tests_private/customer_tests/LP085R16.TWO (bitcmyk 600dpi)
no output
 tests_private/customer_tests/dots.test (ppmraw 600dpi)
no output
 tests_private/customer_tests/dots.test (pbmraw 600dpi)
no output
 tests_private/customer_tests/dots.test (bitcmyk 600dpi)
no output
 tests_private/customer_tests/dots.test (bitrgb 600dpi)
no output
 tests_private/customer_tests/font_selection.test (ppmraw 600dpi)
no output
 tests_private/customer_tests/font_selection.test (pbmraw 600dpi)
no output
 tests_private/customer_tests/font_selection.test (bitcmyk 600dpi)
no output
 tests_private/customer_tests/font_selection.test (bitrgb 600dpi)
no output
 tests_private/customer_tests/large_char.pcl (ppmraw 600dpi)
no output
 tests_private/customer_tests/large_char.pcl (pbmraw 600dpi)
no output
 tests_private/customer_tests/large_char.pcl (bitcmyk 600dpi)
no output
 tests_private/customer_tests/large_char.pcl (bitrgb 600dpi)
no output
 tests_private/customer_tests/lt.test (ppmraw 600dpi)
no output
 tests_private/customer_tests/lt.test (pbmraw 600dpi)
no output
 tests_private/customer_tests/lt.test (bitcmyk 600dpi)
no output
 tests_private/customer_tests/lt0.test (ppmraw 600dpi)
no output
 tests_private/customer_tests/lt.test (bitrgb 600dpi)
no output
 tests_private/customer_tests/lt0.test (pbmraw 600dpi)
no output
 tests_private/customer_tests/lt0.test (bitrgb 600dpi)
no output
 tests_private/customer_tests/lt0.test (bitcmyk 600dpi)
no output
 tests_private/customer_tests/lt8.test (ppmraw 600dpi)
no output
 tests_private/customer_tests/lt8.test (pbmraw 600dpi)
no output
 tests_private/customer_tests/lt82.test (ppmraw 600dpi)
no output
 tests_private/customer_tests/lt82.test (pbmraw 600dpi)
no output
 tests_private/customer_tests/lt8.test (bitcmyk 600dpi)
no output
 tests_private/customer_tests/lt8.test (bitrgb 600dpi)
no output
 tests_private/customer_tests/lt82.test (bitcmyk 600dpi)
no output
 tests_private/customer_tests/lt82.test (bitrgb 600dpi)
no output
 tests_private/customer_tests/pvsres.dru (ppmraw 600dpi)
no output
 tests_private/customer_tests/pvsres.dru (pbmraw 600dpi)
no output
 tests_private/customer_tests/pvsres.dru (bitcmyk 600dpi)
no output
 tests_private/customer_tests/pvsres.dru (bitrgb 600dpi)
no output
 tests_private/customer_tests/prec.test (ppmraw 600dpi)
no output
 tests_private/customer_tests/prec.test (pbmraw 600dpi)
no output
 tests_private/customer_tests/prec.test (bitcmyk 600dpi)
no output
 tests_private/customer_tests/prec.test (bitrgb 600dpi)
no output
 tests_private/customer_tests/sc.test (ppmraw 600dpi)
no output
 tests_private/customer_tests/sc.test (pbmraw 600dpi)
no output
 tests_private/customer_tests/sc.test (bitrgb 600dpi)
no output
 tests_private/customer_tests/sc.test (bitcmyk 600dpi)
no output
 tests_private/customer_tests/ul3.test (ppmraw 600dpi)
no output
 tests_private/customer_tests/ul3.test (pbmraw 600dpi)
no output
 tests_private/customer_tests/ul3.test (bitcmyk 600dpi)
no output
 tests_private/customer_tests/ul3.test (bitrgb 600dpi)
no output
Comment 1 Ray Johnston 2009-03-19 22:10:52 UTC
I am not sure the 'no output' cases are related to the poor performance of
39002wts.pcl

This file (and probably many others) are entering clist mode because the
cluster regression has -dMaxBitmap=30000000 instead of -dMaxBitmap=300000000
(30 million instead of 300 million).

When this happens, because there is a severe memory leak that I need to
track down. It is somewhat worse with the "ialloc" memory manager than with
the "chunk" memory manager because chunks never get freed with the ialloc
manager and memory usage grows faster.

Note that the PCL build defaults to BAND_LIST_STORAGE=memory. Using 'file',
and clist mode this seems to be about the same as with BAND_LIST_STORAGE
= memory, so this is probably not related to the 'memfile' logic.

With page buffer mode, either memory manager completes this job in about
29 seconds (DEBUG build) on my laptop and used about 105Mb. I didn't let the
ialloc case finish, but the chunk case took 3160 seconds and used over 896 Mb.

I think that I can improve the chunk manager case a bit by skipping the
search of chunks when the size requested is not available in any chunk
(a largest_free value for the chain of chunks), but the memory leak when
clist is used is more serious.
Comment 2 Ray Johnston 2009-03-19 23:04:13 UTC
I completed the chunk memory testing with BAND_LIST_STORAGE=file. It used
893Mb (similar) but completed in 2804 sec (10% faster). Big whoop!

There are 1,135,832 objects leaked in this test!!! That's the most important
thing to investigate. Both the RAM requirement growth and the performance are
critical issues.

For the performance issue with 390002wts.pcl, I will assume responsibility
since this is a clist issue (doesn't occur in page buffer mode).

Henry can look at the other issues.

If Henry wants to open a separate bug for the non-performance issues,
that seems reasonable to me.
Comment 3 Ray Johnston 2009-03-19 23:22:26 UTC
Capturing some of the IRC messages on this topic:

On the performance issue, Henry discovered "maybe there is something screwy in
my profile but I show a lot of time on the memset on line 228 of gximage1.c when
running 39002wts.pcl"

Looking at this code this is from Igor, rev 8803, (as if the mangled log and
comment language isn't enough to tell). The salient part of the log message is:
"Working on this patch we suspected memory leaks from image enumerators
 with some devices."

The analysis of the leaks detected with the chunk allocator shows that most
of the leaks (1,135,798) are all 'bytes' of size=768. What do you want to bet
that this is the size of an image_enum ?

More later...
Comment 4 Ray Johnston 2009-03-20 08:55:59 UTC
Ralph reminded me that Henry had recently fixed a leak of indexed color tables.
Size == 768 is the tables. I had not done 'svn up' before testing. After
updating to rev 9576, with the chunk memory manager, using the -dMaxBitmap
at 30 million, and BAND_LIST_STORAGE=file the job completes in a reasonable
time (2 min, but with a LOT of debug messages still in).

With the BAND_LIST_STORAGE=memeory and the chunk allocator, a non-DEBUG build
completes in 108 sec. and uses 40Mb max.

With the ialloc, and BAND_LIST_STORAGE=memory, a non-DEBUG build takes 250
seconds just to complete 30 pages and has used over 350Mb. Something wrong here.

Comment 5 Ray Johnston 2009-03-21 17:18:24 UTC
More info...

The ialloc memory and the chunk memory (with clist memfile both begin the
clist rendering with 4,627K bytes allocated (within a few), but by the end
of rendering the page (30 bands, band_width=5100, band_height=222) the chunk
memory is net zero (same amount of memory used), BUT the ialloc memory has
allocated  16,436K bytes (max used = 1,652Kb). I will continue investigating
and will look into what's happening, but at this point using the ialloc
memory for an embedded system is totally out of the question.

Speaking of questions -- since this is happening during clist playback, WHY
DON'T WE SEE THIS WITH ANY POSTSCRIPT FILES ??? (the garbage collector never
runs during rendering a page).

Henry, I am assigning this to me for the 39000wts.pcl problem. If you want to
open a separate bug for the other issues, that's fine with me, or if you want
to leave them all together, I'll pass it back to you after the memory issues
are understood (particularly in light of my PS comment above).
Comment 6 Ray Johnston 2009-03-21 17:20:13 UTC
Adding Ralph since he is 'memory manager' ;-)
But for now this only happens with the clist. Henry will continue to get
updates since he is the reporter (and he gets updates on everything anyway, as
do I).
Comment 7 Henry Stiles 2009-03-21 17:25:54 UTC
> Ralph reminded me that Henry had recently fixed a leak of indexed color tables.

Correct, in 689822 I wrote:

> This should be partially fixed, reassigning back to ray for followup on
Comment > #11.

I guess you didn't see it.  I'd like to get it closed but I thought you should
follow up on the comment.
Comment 8 Ralph Giles 2009-03-21 19:29:48 UTC
Thanks for the cc, Ray. For now I'm happy to follow along as you investigate.
Comment 9 Ray Johnston 2009-03-23 22:57:05 UTC
I've traced the 'ialloc' functions (base/gsalloc.c) and the problem appears to
be systemic. This allocator is just not very good at recovering freed blocks
and only performs 'annealing' (chunk_consolidate_free) lazily, and after the
free space has been fragmented.

The ialloc memory allocates 30 20,000 byte chunks PER BAND, while the chunk
allocator allocates ZERO chunks. The pattern of allocations and frees, and the
particular (lazy) allocation method used in the ialloc manager seem to be
severely impacted by this file. There are 543 images in the first band,
and the image allocations and the color space switching all add up to a
sequence that confuses this allocator.

Note that I enabled CONSOLIDATE_BEFORE_ADDING_CHUNK in base/gsalloc.c and 
encountered SEGFAULT problems. This code has at least one error that I fixed,
but since this code may not have ever been tested, I didn't work on it any
more.

I will assign this over to Ralph to see if he wants to dig into the problems
in 'ialloc'. I think that the problem is not caused by the clist or other
code 'leaking' during the rendering.

Sorry, Ralph.

I am available to discuss this and can provide logs if desired.

I will apply some of the tricks I've seen in tracing through the ialloc
manager to see if I can speed up the chunk allocator (without causing the
problems seen with this file).

Comment 10 Henry Stiles 2009-03-24 00:41:40 UTC
I'm looking at this also.  

Upon playback there are at least as many chunks allocated as there are images in
the file.  When playing back the band list an image enumerator is created for
each high level image.  The image enumerators are huge because of the color
index lookups that are statically defined in each image enumerator.  This
results in ialloc giving each image enumerator being allocated in its own chunk.
 It is the same for pcl, pdf or postscript (the pcl file can be converted to pdf
or postscript with the appropriate high level device).

It is interesting the file works fine if high level images are turned off. 
There will always be large image enum allocations with or without high level
images. In the former case, the large enumerator, that gets its own chunk, is
allocated by the reader, in the latter it's the writer.  But only reader
allocations seem to foul up the ialloc allocator.  With high level images the
clist writer does not allocate the cache color index lookup table (fortunately).
 I guess the pattern of allocations could explain why the allocator fails when
the allocations are done in the reader but that's weak and needs more analysis.

Also, the postscript and pdf equivalents of these files don't have performance
problems, I believe the free object method behaves differently when the maximum
limit is reached in pdf and ps because of the presence of a gc, but I haven't
thoroughly analyzed this.  I have not been able to reproduce the performance
problem using NOGC, but I discovered NOGC doesn't really disable everything.
Comment 11 Henry Stiles 2009-03-24 12:40:34 UTC
Following up on the meeting (and the last paragraph of comment #10). I am able
to reproduce the problem in the postscript analog of the pcl problem by
increasing gsalloc.c:FORCE_GC_LIMIT.

/usr/bin/time ./gs -Z: -q -dNOPAUSE -r600 -dNOGC -sDEVICE=ppmraw
-sOutputFile=/dev/null /Users/henrys/ghostpdl_old/gs/bin/foo.ps -c quit
[:]width=5100, band_width=5100, band_height=222, nbands=30
[:]width=5100, band_width=5100, band_height=222, nbands=30
% Start time = 0.098639, memory allocated = 8514620, used = 8406523
% Outputpage start time = 0.497928, memory allocated = 15982592, used = 15276340
[:]clist_end_page at cfile=620826, bfile=1488
[:]width=5100, band_width=5100, band_height=222, nbands=30
% Outputpage end time = 1.26203, memory allocated = 16002688, used = 15283484
% Outputpage start time = 1.63702, memory allocated = 31251968, used = 29304426
[:]clist_end_page at cfile=1543152, bfile=1488
[:]width=5100, band_width=5100, band_height=222, nbands=30
% Outputpage end time = 3.45691, memory allocated = 31251968, used = 29308026
% Outputpage start time = 3.59004, memory allocated = 35569024, used = 33348548
[:]clist_end_page at cfile=461034, bfile=1488
[:]width=5100, band_width=5100, band_height=222, nbands=30
% Outputpage end time = 4.28374, memory allocated = 35569024, used = 33352148
% Outputpage start time = 4.41724, memory allocated = 39524352, used = 36909798
[:]clist_end_page at cfile=459978, bfile=1488
[:]width=5100, band_width=5100, band_height=222, nbands=30
% Outputpage end time = 5.13661, memory allocated = 39524352, used = 36913398
% Outputpage start time = 5.25474, memory allocated = 42876800, used = 39952320
[:]clist_end_page at cfile=469686, bfile=1488
[:]width=5100, band_width=5100, band_height=222, nbands=30
% Outputpage end time = 5.98285, memory allocated = 42876800, used = 39955920
% Outputpage start time = 6.45363, memory allocated = 56659072, used = 52488562
[:]clist_end_page at cfile=1437019, bfile=1488
[:]width=5100, band_width=5100, band_height=222, nbands=30
% Outputpage end time = 8.24536, memory allocated = 56659072, used = 52492162
% Outputpage start time = 8.36988, memory allocated = 59689984, used = 55236424
[:]clist_end_page at cfile=432477, bfile=1488
[:]width=5100, band_width=5100, band_height=222, nbands=30
% Outputpage end time = 9.04811, memory allocated = 59689984, used = 55240024
% Outputpage start time = 9.25027, memory allocated = 64529536, used = 59711914
[:]clist_end_page at cfile=524303, bfile=1488
[:]width=5100, band_width=5100, band_height=222, nbands=30
% Outputpage end time = 10.0608, memory allocated = 64529536, used = 59715514
% Outputpage start time = 10.277, memory allocated = 69549952, used = 64415800
[:]clist_end_page at cfile=580423, bfile=1488
[:]width=5100, band_width=5100, band_height=222, nbands=30
% Outputpage end time = 11.1792, memory allocated = 69549952, used = 64419400
% Outputpage start time = 12.242, memory allocated = 91451008, used = 84529122
[:]clist_end_page at cfile=2056139, bfile=1488
[:]width=5100, band_width=5100, band_height=222, nbands=30
% Outputpage end time = 15.0061, memory allocated = 91451008, used = 84532722
% Outputpage start time = 15.2523, memory allocated = 95768064, used = 88543780
[:]clist_end_page at cfile=474928, bfile=1488
[:]width=5100, band_width=5100, band_height=222, nbands=30
% Outputpage end time = 16.038, memory allocated = 95768064, used = 88547380
% Outputpage start time = 16.2574, memory allocated = 99843968, used = 92237082
[:]clist_end_page at cfile=561631, bfile=1488
[:]width=5100, band_width=5100, band_height=222, nbands=30
% Outputpage end time = 17.1707, memory allocated = 99843968, used = 92240682
% Outputpage start time = 17.3719, memory allocated = 103337088, used = 95413896
[:]clist_end_page at cfile=483904, bfile=1488
[:]width=5100, band_width=5100, band_height=222, nbands=30
Comment 12 Henry Stiles 2009-03-25 12:02:00 UTC
Aha, The degradation in performance when high level images is turned on it due
to the banding machinery allocating the enum more frequently then the writer
when high level images are off.  There are 6616 image enum allocation without
high level images and 125398 allocations when using high level images.  It
seemed unlikely this discrepancy was due to the pattern of allocations.
Comment 13 Ray Johnston 2009-03-25 12:13:09 UTC
While the clist performance is expected to be higher, the progressive
degradation as we process more pages is due to the accumulation of memory and
the amount of time it takes to do the scan of all chunks before allocating
another one.

Without this accumulation (when using the chunk memory alocator) the processing
time is reasonable, as is the total memory used to process the file.
Comment 14 Henry Stiles 2009-03-25 14:33:51 UTC
We really want to fix ialloc and not maintain 2 chunk allocators, so I have
ruled out using the chunk allocator in pcl.  It isn't clear there is a problem
with ialloc per se, it may be how pcl is setting up the allocator, as stated the
postscript equivalent of the problem does not accumulate memory unless
FORCE_GC_LIMIT is substantially increased.  I had thought gs_consolidate_free()
at end page time would clear this up, but it didn't work.  I'll have to look
more carefully at what the other languages do, to get rid of these accumulating
chunks.
Comment 15 Henry Stiles 2009-03-26 14:45:52 UTC
the performance problem is fixed in ghospdl 9599, which is why it was assigned
to Ralph, reassigning to Henry to look at the remaining pcl files.
Comment 16 Ray Johnston 2009-03-31 22:52:47 UTC
Thanks to Henry for identifying that the memory accumulation was due to 'lost'
strings. The 'ialloc' manager only recovers the space when the string being
freed is immediately adjacent to the 'ctop' of the chunk. The gs_nogc_reclaim
has the side effect of replacing the alloc_string, free_string and consolidate_
free procs with "smarter" versions that are better at recovering the space of
strings that are not immediately adjacent to the ctop.

The chunk memory manager does not have the defect that the ialloc does, since
strings are treated the same as other elements and are immediately added to
the free list of the chunk, annealing adjacent free spaces as needed.

The allocation pattern that caused this persists and with PostScript (or
PDF) this type of file causes accumulation of space. Setting the vmthreshold
to a large number exhibits the growth. The sequence of allocation/free that
causes the problem is attributed to the gxclrast.c read_set_color_space
when the color space is 'Indexed'. The palette table is allocated as a string,
but if the previous color space was also Indexed, its table was not freed
until AFTER the new table was allocated, so the the free_string was not
adjacent to ctop, thus 'losing' the string free space with the ialloc manager
until a GC happened.

I changed the 'read_set_color_space' function to rc_decrement the old color
space BEFORE allocating the new one and with the gs_nogc_reclaim code removed
in pl/plmain.c, the memory growth is gone and the file completes to ppmraw
600 dpi in 67 sec. Since the current svn code completes in the same time,
there is no performance disadvantage to running the nogc string functions
with the 39002wts.pcl file.

I will perform regression testing on the gxclrast.c change and commit it if
it looks OK since it will help PS files which use the 'stock' ialloc string
methods.
Comment 17 Henry Stiles 2010-09-10 16:02:18 UTC
Looks like this was finished up and never closed.