Summary: | many pcl files from the customer test file directory have problems. | ||
---|---|---|---|
Product: | GhostPCL | Reporter: | Henry Stiles <henry.stiles> |
Component: | PCL interpreter | Assignee: | Henry Stiles <henry.stiles> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | ralph.giles |
Priority: | P4 | ||
Version: | master | ||
Hardware: | Macintosh | ||
OS: | MacOS X | ||
Customer: | Word Size: | --- |
Description
Henry Stiles
2009-03-19 19:45:59 UTC
I am not sure the 'no output' cases are related to the poor performance of 39002wts.pcl This file (and probably many others) are entering clist mode because the cluster regression has -dMaxBitmap=30000000 instead of -dMaxBitmap=300000000 (30 million instead of 300 million). When this happens, because there is a severe memory leak that I need to track down. It is somewhat worse with the "ialloc" memory manager than with the "chunk" memory manager because chunks never get freed with the ialloc manager and memory usage grows faster. Note that the PCL build defaults to BAND_LIST_STORAGE=memory. Using 'file', and clist mode this seems to be about the same as with BAND_LIST_STORAGE = memory, so this is probably not related to the 'memfile' logic. With page buffer mode, either memory manager completes this job in about 29 seconds (DEBUG build) on my laptop and used about 105Mb. I didn't let the ialloc case finish, but the chunk case took 3160 seconds and used over 896 Mb. I think that I can improve the chunk manager case a bit by skipping the search of chunks when the size requested is not available in any chunk (a largest_free value for the chain of chunks), but the memory leak when clist is used is more serious. I completed the chunk memory testing with BAND_LIST_STORAGE=file. It used 893Mb (similar) but completed in 2804 sec (10% faster). Big whoop! There are 1,135,832 objects leaked in this test!!! That's the most important thing to investigate. Both the RAM requirement growth and the performance are critical issues. For the performance issue with 390002wts.pcl, I will assume responsibility since this is a clist issue (doesn't occur in page buffer mode). Henry can look at the other issues. If Henry wants to open a separate bug for the non-performance issues, that seems reasonable to me. Capturing some of the IRC messages on this topic: On the performance issue, Henry discovered "maybe there is something screwy in my profile but I show a lot of time on the memset on line 228 of gximage1.c when running 39002wts.pcl" Looking at this code this is from Igor, rev 8803, (as if the mangled log and comment language isn't enough to tell). The salient part of the log message is: "Working on this patch we suspected memory leaks from image enumerators with some devices." The analysis of the leaks detected with the chunk allocator shows that most of the leaks (1,135,798) are all 'bytes' of size=768. What do you want to bet that this is the size of an image_enum ? More later... Ralph reminded me that Henry had recently fixed a leak of indexed color tables. Size == 768 is the tables. I had not done 'svn up' before testing. After updating to rev 9576, with the chunk memory manager, using the -dMaxBitmap at 30 million, and BAND_LIST_STORAGE=file the job completes in a reasonable time (2 min, but with a LOT of debug messages still in). With the BAND_LIST_STORAGE=memeory and the chunk allocator, a non-DEBUG build completes in 108 sec. and uses 40Mb max. With the ialloc, and BAND_LIST_STORAGE=memory, a non-DEBUG build takes 250 seconds just to complete 30 pages and has used over 350Mb. Something wrong here. More info... The ialloc memory and the chunk memory (with clist memfile both begin the clist rendering with 4,627K bytes allocated (within a few), but by the end of rendering the page (30 bands, band_width=5100, band_height=222) the chunk memory is net zero (same amount of memory used), BUT the ialloc memory has allocated 16,436K bytes (max used = 1,652Kb). I will continue investigating and will look into what's happening, but at this point using the ialloc memory for an embedded system is totally out of the question. Speaking of questions -- since this is happening during clist playback, WHY DON'T WE SEE THIS WITH ANY POSTSCRIPT FILES ??? (the garbage collector never runs during rendering a page). Henry, I am assigning this to me for the 39000wts.pcl problem. If you want to open a separate bug for the other issues, that's fine with me, or if you want to leave them all together, I'll pass it back to you after the memory issues are understood (particularly in light of my PS comment above). Adding Ralph since he is 'memory manager' ;-) But for now this only happens with the clist. Henry will continue to get updates since he is the reporter (and he gets updates on everything anyway, as do I). > Ralph reminded me that Henry had recently fixed a leak of indexed color tables. Correct, in 689822 I wrote: > This should be partially fixed, reassigning back to ray for followup on Comment > #11. I guess you didn't see it. I'd like to get it closed but I thought you should follow up on the comment. Thanks for the cc, Ray. For now I'm happy to follow along as you investigate. I've traced the 'ialloc' functions (base/gsalloc.c) and the problem appears to be systemic. This allocator is just not very good at recovering freed blocks and only performs 'annealing' (chunk_consolidate_free) lazily, and after the free space has been fragmented. The ialloc memory allocates 30 20,000 byte chunks PER BAND, while the chunk allocator allocates ZERO chunks. The pattern of allocations and frees, and the particular (lazy) allocation method used in the ialloc manager seem to be severely impacted by this file. There are 543 images in the first band, and the image allocations and the color space switching all add up to a sequence that confuses this allocator. Note that I enabled CONSOLIDATE_BEFORE_ADDING_CHUNK in base/gsalloc.c and encountered SEGFAULT problems. This code has at least one error that I fixed, but since this code may not have ever been tested, I didn't work on it any more. I will assign this over to Ralph to see if he wants to dig into the problems in 'ialloc'. I think that the problem is not caused by the clist or other code 'leaking' during the rendering. Sorry, Ralph. I am available to discuss this and can provide logs if desired. I will apply some of the tricks I've seen in tracing through the ialloc manager to see if I can speed up the chunk allocator (without causing the problems seen with this file). I'm looking at this also. Upon playback there are at least as many chunks allocated as there are images in the file. When playing back the band list an image enumerator is created for each high level image. The image enumerators are huge because of the color index lookups that are statically defined in each image enumerator. This results in ialloc giving each image enumerator being allocated in its own chunk. It is the same for pcl, pdf or postscript (the pcl file can be converted to pdf or postscript with the appropriate high level device). It is interesting the file works fine if high level images are turned off. There will always be large image enum allocations with or without high level images. In the former case, the large enumerator, that gets its own chunk, is allocated by the reader, in the latter it's the writer. But only reader allocations seem to foul up the ialloc allocator. With high level images the clist writer does not allocate the cache color index lookup table (fortunately). I guess the pattern of allocations could explain why the allocator fails when the allocations are done in the reader but that's weak and needs more analysis. Also, the postscript and pdf equivalents of these files don't have performance problems, I believe the free object method behaves differently when the maximum limit is reached in pdf and ps because of the presence of a gc, but I haven't thoroughly analyzed this. I have not been able to reproduce the performance problem using NOGC, but I discovered NOGC doesn't really disable everything. Following up on the meeting (and the last paragraph of comment #10). I am able to reproduce the problem in the postscript analog of the pcl problem by increasing gsalloc.c:FORCE_GC_LIMIT. /usr/bin/time ./gs -Z: -q -dNOPAUSE -r600 -dNOGC -sDEVICE=ppmraw -sOutputFile=/dev/null /Users/henrys/ghostpdl_old/gs/bin/foo.ps -c quit [:]width=5100, band_width=5100, band_height=222, nbands=30 [:]width=5100, band_width=5100, band_height=222, nbands=30 % Start time = 0.098639, memory allocated = 8514620, used = 8406523 % Outputpage start time = 0.497928, memory allocated = 15982592, used = 15276340 [:]clist_end_page at cfile=620826, bfile=1488 [:]width=5100, band_width=5100, band_height=222, nbands=30 % Outputpage end time = 1.26203, memory allocated = 16002688, used = 15283484 % Outputpage start time = 1.63702, memory allocated = 31251968, used = 29304426 [:]clist_end_page at cfile=1543152, bfile=1488 [:]width=5100, band_width=5100, band_height=222, nbands=30 % Outputpage end time = 3.45691, memory allocated = 31251968, used = 29308026 % Outputpage start time = 3.59004, memory allocated = 35569024, used = 33348548 [:]clist_end_page at cfile=461034, bfile=1488 [:]width=5100, band_width=5100, band_height=222, nbands=30 % Outputpage end time = 4.28374, memory allocated = 35569024, used = 33352148 % Outputpage start time = 4.41724, memory allocated = 39524352, used = 36909798 [:]clist_end_page at cfile=459978, bfile=1488 [:]width=5100, band_width=5100, band_height=222, nbands=30 % Outputpage end time = 5.13661, memory allocated = 39524352, used = 36913398 % Outputpage start time = 5.25474, memory allocated = 42876800, used = 39952320 [:]clist_end_page at cfile=469686, bfile=1488 [:]width=5100, band_width=5100, band_height=222, nbands=30 % Outputpage end time = 5.98285, memory allocated = 42876800, used = 39955920 % Outputpage start time = 6.45363, memory allocated = 56659072, used = 52488562 [:]clist_end_page at cfile=1437019, bfile=1488 [:]width=5100, band_width=5100, band_height=222, nbands=30 % Outputpage end time = 8.24536, memory allocated = 56659072, used = 52492162 % Outputpage start time = 8.36988, memory allocated = 59689984, used = 55236424 [:]clist_end_page at cfile=432477, bfile=1488 [:]width=5100, band_width=5100, band_height=222, nbands=30 % Outputpage end time = 9.04811, memory allocated = 59689984, used = 55240024 % Outputpage start time = 9.25027, memory allocated = 64529536, used = 59711914 [:]clist_end_page at cfile=524303, bfile=1488 [:]width=5100, band_width=5100, band_height=222, nbands=30 % Outputpage end time = 10.0608, memory allocated = 64529536, used = 59715514 % Outputpage start time = 10.277, memory allocated = 69549952, used = 64415800 [:]clist_end_page at cfile=580423, bfile=1488 [:]width=5100, band_width=5100, band_height=222, nbands=30 % Outputpage end time = 11.1792, memory allocated = 69549952, used = 64419400 % Outputpage start time = 12.242, memory allocated = 91451008, used = 84529122 [:]clist_end_page at cfile=2056139, bfile=1488 [:]width=5100, band_width=5100, band_height=222, nbands=30 % Outputpage end time = 15.0061, memory allocated = 91451008, used = 84532722 % Outputpage start time = 15.2523, memory allocated = 95768064, used = 88543780 [:]clist_end_page at cfile=474928, bfile=1488 [:]width=5100, band_width=5100, band_height=222, nbands=30 % Outputpage end time = 16.038, memory allocated = 95768064, used = 88547380 % Outputpage start time = 16.2574, memory allocated = 99843968, used = 92237082 [:]clist_end_page at cfile=561631, bfile=1488 [:]width=5100, band_width=5100, band_height=222, nbands=30 % Outputpage end time = 17.1707, memory allocated = 99843968, used = 92240682 % Outputpage start time = 17.3719, memory allocated = 103337088, used = 95413896 [:]clist_end_page at cfile=483904, bfile=1488 [:]width=5100, band_width=5100, band_height=222, nbands=30 Aha, The degradation in performance when high level images is turned on it due to the banding machinery allocating the enum more frequently then the writer when high level images are off. There are 6616 image enum allocation without high level images and 125398 allocations when using high level images. It seemed unlikely this discrepancy was due to the pattern of allocations. While the clist performance is expected to be higher, the progressive degradation as we process more pages is due to the accumulation of memory and the amount of time it takes to do the scan of all chunks before allocating another one. Without this accumulation (when using the chunk memory alocator) the processing time is reasonable, as is the total memory used to process the file. We really want to fix ialloc and not maintain 2 chunk allocators, so I have ruled out using the chunk allocator in pcl. It isn't clear there is a problem with ialloc per se, it may be how pcl is setting up the allocator, as stated the postscript equivalent of the problem does not accumulate memory unless FORCE_GC_LIMIT is substantially increased. I had thought gs_consolidate_free() at end page time would clear this up, but it didn't work. I'll have to look more carefully at what the other languages do, to get rid of these accumulating chunks. the performance problem is fixed in ghospdl 9599, which is why it was assigned to Ralph, reassigning to Henry to look at the remaining pcl files. Thanks to Henry for identifying that the memory accumulation was due to 'lost' strings. The 'ialloc' manager only recovers the space when the string being freed is immediately adjacent to the 'ctop' of the chunk. The gs_nogc_reclaim has the side effect of replacing the alloc_string, free_string and consolidate_ free procs with "smarter" versions that are better at recovering the space of strings that are not immediately adjacent to the ctop. The chunk memory manager does not have the defect that the ialloc does, since strings are treated the same as other elements and are immediately added to the free list of the chunk, annealing adjacent free spaces as needed. The allocation pattern that caused this persists and with PostScript (or PDF) this type of file causes accumulation of space. Setting the vmthreshold to a large number exhibits the growth. The sequence of allocation/free that causes the problem is attributed to the gxclrast.c read_set_color_space when the color space is 'Indexed'. The palette table is allocated as a string, but if the previous color space was also Indexed, its table was not freed until AFTER the new table was allocated, so the the free_string was not adjacent to ctop, thus 'losing' the string free space with the ialloc manager until a GC happened. I changed the 'read_set_color_space' function to rc_decrement the old color space BEFORE allocating the new one and with the gs_nogc_reclaim code removed in pl/plmain.c, the memory growth is gone and the file completes to ppmraw 600 dpi in 67 sec. Since the current svn code completes in the same time, there is no performance disadvantage to running the nogc string functions with the 39002wts.pcl file. I will perform regression testing on the gxclrast.c change and commit it if it looks OK since it will help PS files which use the 'stock' ialloc string methods. Looks like this was finished up and never closed. |