Bug 702099

Summary: Unexepcted large memory use, and error on Windows
Product: Ghostscript Reporter: Ken Sharp <ken.sharp>
Component: Graphics LibraryAssignee: Ray Johnston <ray.johnston>
Status: NOTIFIED FIXED    
Severity: normal    
Priority: P2    
Version: master   
Hardware: PC   
OS: Windows 10   
Customer: 650 Word Size: ---

Description Ken Sharp 2020-02-06 10:22:51 UTC
Created attachment 18834 [details]
customer file

Using the attached customer file and the following command line:

-dNumRenderingThreads=4 -sDEVICE=fpng -r720 -sOutputFile=out.png -dBufferSpace=67108864 -dNOPAUSE -dNOPROMPT -dBATCH -dPrinted -dNOPSICC -dUseCropBox 20140408_tfnsw_learner_journey_a2.pdf

Ghostscript on Windows, either 32 or 64-bit, debug or release, throws an error and exits. The debug build gives more information so I'm quoting that:

------------------------------------------------------------------------------
D:\temp>\ghostpdl\debugbin\gswin64c -sBandListStorage=memory -dNumRenderingThreads=4 -sDEVICE=tiffsep -r720 -sOutputFile=out.tif -dBufferSpace=67108864 -dNOPAUSE -dNOPROMPT -dBATCH -dPrinted -dNOPSICC -dUseCropBox D:\ken\Downloads\20140408_tfnsw_learner_journey_a2.pdf
GPL Ghostscript GIT PRERELEASE 9.51 (2019-10-15)
Copyright (C) 2019 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 1 through 1.
Page 1
Querying operating system for font files...
Substituting font Helvetica-Bold for Arial-BoldMT.
Loading NimbusSans-Bold font from %rom%Resource/Font/NimbusSans-Bold... 4520268 3037334 3860664 2347275 4 done.
Substituting font Helvetica for ArialMT.
Loading NimbusSans-Regular font from %rom%Resource/Font/NimbusSans-Regular... 4586380 3213439 3934408 2422462 4 done.
Substituting font Helvetica-BoldOblique for Arial-BoldItalicMT.
Loading NimbusSans-BoldItalic font from %rom%Resource/Font/NimbusSans-BoldItalic... 5200212 3854798 4432352 2901641 4 done.
[a+]gs_malloc(memfile newphys)(65536) = 0x0: exceeded limit, used=2147452530, max=2147452530
[a+]gs_malloc(memfile newphys)(65536) = 0x0: exceeded limit, used=2147452530, max=2147452530
[a+]gs_malloc(memfile newphys)(65536) = 0x0: exceeded limit, used=2147452530, max=2147452530
[a+]gs_malloc(memfile newphys)(65536) = 0x0: exceeded limit, used=2147452530, max=2147452530
[a+]gs_malloc(memfile newphys)(65536) = 0x0: exceeded limit, used=2147452530, max=2147452530
[a+]gs_malloc(memfile newphys)(65536) = 0x0: exceeded limit, used=2147452530, max=2147452530
[a+]gs_malloc(memfile newphys)(65536) = 0x0: exceeded limit, used=2147452530, max=2147452530
[a+]gs_malloc(memfile newphys)(65536) = 0x0: exceeded limit, used=2147452530, max=2147452530
Clist I/O error: cbp past end of buffer
Clist I/O error: cbp past end of buffer
Clist I/O error: cbp past end of buffer
Clist I/O error: cbp past end of buffer

Thread 3 rendering thread ending memory state...
                                    memory dump done.

Thread 2 rendering thread ending memory state...
                                    memory dump done.

Thread 1 rendering thread ending memory state...
                                    memory dump done.

Thread 0 rendering thread ending memory state...
                                    memory dump done.
Error: /unknownerror in --showpage--
Operand stack:
   1   true
Execution stack:
   %interp_exit   .runexec2   --nostringval--   showpage   --nostringval--   2   %stopped_push   --nostringval--   showpage   showpage   false   1   %stopped_push   1990   1   3   %oparray_pop   1989   1   3   %oparray_pop   1977   1   3   %oparray_pop   1978   1   3   %oparray_pop   showpage   showpage   2   1   1   showpage   %for_pos_int_continue   1981   1   7   %oparray_pop   showpage   showpage   1840   0   9   %oparray_pop   showpage   showpage
Dictionary stack:
   --dict:746/1123(ro)(G)--   --dict:1/20(G)--   --dict:80/200(L)--   --dict:80/200(L)--   --dict:135/256(ro)(G)--   --dict:318/325(ro)(G)--   --dict:30/32(L)--   --dict:6/9(L)--   --dict:7/20(L)--
Current allocation mode is local
GPL Ghostscript GIT PRERELEASE 9.51: Unrecoverable error, exit code 1


------------------------------------------------------------------------------

Process Monitor shows that the GS executable is using 1.5GB of memory immediately before the failed allocation.

The customer says that running on Linux or Mac OS succeeds, but I don't have access to a machine running either OS with that much memory available. Reducing the resolution, or the number of threads, causes the file to run to completion using less memory (2 threads completes doe me peaking at just under 1.5 GB).

The file is comparatively simple, I see no transparency nor Pattern colour spaces. However two images do have Interpolation set to true. If I run the failing command line with the addition of -dNOINTERPOLATE then the file runs to completion peaking at ~500MB of memory usage (its also *much* faster).
Comment 1 Ken Sharp 2020-02-06 10:26:31 UTC
I've tentatively assigned this to Ray, because its multi-threading and clist. However the -dNOINTERPOLATE may mean its really Robin's.

I guess the questions are:

Why does this use so much memory ?
Why doesn't it exit on the first failed allocation ?
Why does 64-bit Ghostscript on Windows apparently have a limit of 2GB memory (my machine has 64GB physical memory available) ?

This isn't a crash, its a graceful exit, so I suspect simply explaining the situation might be sufficient.
Comment 2 Ken Sharp 2020-02-06 12:43:45 UTC
So following up it seems the problem is due to the fact that the customer's command line has -dBufferSpace=67108864 and has 4 rendering threads, and the fact that 64-bit Windows has a limit of 2GB on total memory allocation.

Chrisl points out that gdevprn.c line 254 has a loop where we try to allocate BufferSpace and run round decreasing it until we get a vlue which we can actually allocate. However, this takes no account of multiple rendering threads, so we can end up in the case where we can allocate BufferSpace for one thread, but not for as many threads as are spawned.

I'm guessing that if we were to improve this loop to acount for multiple threads, we would be able to reduce the BufferSpace to fit into the available memory, and this problem would not arise.

On a related note we shoudl not be limiting Windows 64-bit builds to 2GB memory, and I'll open a separate bug for that.
Comment 3 Ray Johnston 2020-02-06 16:26:25 UTC
OK, the use of BandListStorage=memory is the first issue. The size of the
clist is really not limited, depending on the complexity of the page, and
really has nothing to do with the BufferSpace size.

On most modern computers, using BandListStorage=file is no slower, and if the
memfile is working correctly, it is supposed to switch to compression if an
allocation fails -- I'll look into that. Obviously, compressing, then
expanding the clist during playback will be painfully slow, particularly with
multiple threads. The BandListStorage=memory is really intended for diskless
systems (embedded printers).

Next, if we do fail when writing the clist (for any reason) we should *NOT*
continue and try to render, much less with multiple threads. I will fix that!

A BufferSpace setting of 64m (67,108,864) is totally reasonable, and is
probably not at issue here. Generally larger buffer sizes results in more
lines per band which makes the clist (slightly) more efficient, particularly
if there are skewed images.

Lastly, the clist logic *tries* to defer interpolation to rendering so that
multiple threads can contribute, but there are many (too many to mention)
conditions that can cause the clist writing to fall back to processing the
image in 'default' mode, which when interpolating will result in 'copy_color'
primitives for all the pixels. This will slow things down (interpolating and
writing the image) and expand the clist size.

While -dNOINTERPOLATE is a 'sledgehammer' to avoid interpolation, we have a
control over how much interpolation is performed. Interpolation to 720dpi
is probably not visible therefore wasted time. The better control is to use
-dInterpolateControl=__ as described in
   https://www.ghostscript.com/doc/current/Use.htm#Rendering_parameters

For example, -dInterpolateControl=4 will interpolate images that have the
'/Interpolate true' in the image dictionary, but ONLY to 4x4 pixels blocks,
i.e. 180dpi when using -r720.

This speeds up processing, even if the interpolation is being performed in
multiple rendering threads.

I will look into the problems (trying to render after memfile creation error,
and why switching to compression was not invoked), but limiting interpolation
and using disk based clist are STRONGLY recommended in general.
Comment 4 Ray Johnston 2020-02-06 23:35:59 UTC
I get a somewhat different result than Ken did. Running 32-bit Windows, I
see the problem is that the large memfile is actually a pattern-clist. This
is 585,319,595 in total size (the original Image, object 46 in the PDF is
663 lines, 663 pixels per line, RGB). The image at 720dpi, at the scaling
of the PDF this turns into a 13945x13945 image (after interpolation). This
calculates to 583,389,075 bytes of RGB data. This data is written to the
'main' clist.

It is in the threads when we are reading from the 'memfile' data for the
pattern from the main thread for the pattern cache entry. Thus 4 threads
need 4*585,319,595 = 2,341,278,380 bytes, so we get allocation failures.

I'm looking at why the memfile doesn't trigger compression when the alloc
fails. If the memfile compression is used, it reduces the required storage
to < 32Mb.

Note that with -dInterpolateControl=4 it seems that the method, while faster,
does not reduce the clist filesize. I will discuss this internally to decide
if we want to investigate it another method.
Comment 5 Ray Johnston 2020-02-07 02:58:07 UTC
Apparently improvements in the image interpolation has improved things so that
the patch to use default image processing when an interpolated image is
being processed into a pattern-clist (see bug 694514).

With current HEAD (SHA e193b5d) this takes 23.3 seconds. With the patch removed
that restricts HL images to the pattern-clist, this file completes in 24.8
seconds.

With this file, the pattern-clist (without HL images) uses > 500Mb, and the
file (a single page) takes 192 sec. Enabling HL images reduces the pattern-
clist to 1.2Mb and the page completes in 36 sec.

With -dNumRenderingThreads=4 (and with HL images enabled for pattern-clist)
this file completes in < 15 sec.
Comment 6 Ray Johnston 2020-02-07 06:45:48 UTC
I'm not totally convinced from a git bisect run (since I have to remove the
disable of pattern-clist between bisect builds), but initially it showed up
as commit 7e342b59, but this might have been masking an earlier commit that
allowed the pattern-clist mode to efficiently support interpolation.
Comment 7 Ray Johnston 2020-02-12 00:47:26 UTC
I did more testing with the file from bug 694514 and the commit 8117f3e31
lets that file complete in under 3 seconds when HL interpolated images are
enabled for pattern_clist. The previous commit 0f3691890 takes almost 32
seconds.

Without the pattern_clist HL enabling of interpolated images, the 8117f3e31
takes over 4 seconds.

Looks like the patch for bug694514 is no longer needed.
Comment 8 Ray Johnston 2020-02-13 15:02:51 UTC
Fixed in commit 1f6b351a5048b5a14d52431c833da6ed3386a7bf