Summary: | Unexepcted large memory use, and error on Windows | ||
---|---|---|---|
Product: | Ghostscript | Reporter: | Ken Sharp <ken.sharp> |
Component: | Graphics Library | Assignee: | Ray Johnston <ray.johnston> |
Status: | NOTIFIED FIXED | ||
Severity: | normal | ||
Priority: | P2 | ||
Version: | master | ||
Hardware: | PC | ||
OS: | Windows 10 | ||
Customer: | 650 | Word Size: | --- |
Description
Ken Sharp
2020-02-06 10:22:51 UTC
I've tentatively assigned this to Ray, because its multi-threading and clist. However the -dNOINTERPOLATE may mean its really Robin's. I guess the questions are: Why does this use so much memory ? Why doesn't it exit on the first failed allocation ? Why does 64-bit Ghostscript on Windows apparently have a limit of 2GB memory (my machine has 64GB physical memory available) ? This isn't a crash, its a graceful exit, so I suspect simply explaining the situation might be sufficient. So following up it seems the problem is due to the fact that the customer's command line has -dBufferSpace=67108864 and has 4 rendering threads, and the fact that 64-bit Windows has a limit of 2GB on total memory allocation. Chrisl points out that gdevprn.c line 254 has a loop where we try to allocate BufferSpace and run round decreasing it until we get a vlue which we can actually allocate. However, this takes no account of multiple rendering threads, so we can end up in the case where we can allocate BufferSpace for one thread, but not for as many threads as are spawned. I'm guessing that if we were to improve this loop to acount for multiple threads, we would be able to reduce the BufferSpace to fit into the available memory, and this problem would not arise. On a related note we shoudl not be limiting Windows 64-bit builds to 2GB memory, and I'll open a separate bug for that. OK, the use of BandListStorage=memory is the first issue. The size of the clist is really not limited, depending on the complexity of the page, and really has nothing to do with the BufferSpace size. On most modern computers, using BandListStorage=file is no slower, and if the memfile is working correctly, it is supposed to switch to compression if an allocation fails -- I'll look into that. Obviously, compressing, then expanding the clist during playback will be painfully slow, particularly with multiple threads. The BandListStorage=memory is really intended for diskless systems (embedded printers). Next, if we do fail when writing the clist (for any reason) we should *NOT* continue and try to render, much less with multiple threads. I will fix that! A BufferSpace setting of 64m (67,108,864) is totally reasonable, and is probably not at issue here. Generally larger buffer sizes results in more lines per band which makes the clist (slightly) more efficient, particularly if there are skewed images. Lastly, the clist logic *tries* to defer interpolation to rendering so that multiple threads can contribute, but there are many (too many to mention) conditions that can cause the clist writing to fall back to processing the image in 'default' mode, which when interpolating will result in 'copy_color' primitives for all the pixels. This will slow things down (interpolating and writing the image) and expand the clist size. While -dNOINTERPOLATE is a 'sledgehammer' to avoid interpolation, we have a control over how much interpolation is performed. Interpolation to 720dpi is probably not visible therefore wasted time. The better control is to use -dInterpolateControl=__ as described in https://www.ghostscript.com/doc/current/Use.htm#Rendering_parameters For example, -dInterpolateControl=4 will interpolate images that have the '/Interpolate true' in the image dictionary, but ONLY to 4x4 pixels blocks, i.e. 180dpi when using -r720. This speeds up processing, even if the interpolation is being performed in multiple rendering threads. I will look into the problems (trying to render after memfile creation error, and why switching to compression was not invoked), but limiting interpolation and using disk based clist are STRONGLY recommended in general. I get a somewhat different result than Ken did. Running 32-bit Windows, I see the problem is that the large memfile is actually a pattern-clist. This is 585,319,595 in total size (the original Image, object 46 in the PDF is 663 lines, 663 pixels per line, RGB). The image at 720dpi, at the scaling of the PDF this turns into a 13945x13945 image (after interpolation). This calculates to 583,389,075 bytes of RGB data. This data is written to the 'main' clist. It is in the threads when we are reading from the 'memfile' data for the pattern from the main thread for the pattern cache entry. Thus 4 threads need 4*585,319,595 = 2,341,278,380 bytes, so we get allocation failures. I'm looking at why the memfile doesn't trigger compression when the alloc fails. If the memfile compression is used, it reduces the required storage to < 32Mb. Note that with -dInterpolateControl=4 it seems that the method, while faster, does not reduce the clist filesize. I will discuss this internally to decide if we want to investigate it another method. Apparently improvements in the image interpolation has improved things so that the patch to use default image processing when an interpolated image is being processed into a pattern-clist (see bug 694514). With current HEAD (SHA e193b5d) this takes 23.3 seconds. With the patch removed that restricts HL images to the pattern-clist, this file completes in 24.8 seconds. With this file, the pattern-clist (without HL images) uses > 500Mb, and the file (a single page) takes 192 sec. Enabling HL images reduces the pattern- clist to 1.2Mb and the page completes in 36 sec. With -dNumRenderingThreads=4 (and with HL images enabled for pattern-clist) this file completes in < 15 sec. I'm not totally convinced from a git bisect run (since I have to remove the disable of pattern-clist between bisect builds), but initially it showed up as commit 7e342b59, but this might have been masking an earlier commit that allowed the pattern-clist mode to efficiently support interpolation. I did more testing with the file from bug 694514 and the commit 8117f3e31 lets that file complete in under 3 seconds when HL interpolated images are enabled for pattern_clist. The previous commit 0f3691890 takes almost 32 seconds. Without the pattern_clist HL enabling of interpolated images, the 8117f3e31 takes over 4 seconds. Looks like the patch for bug694514 is no longer needed. Fixed in commit 1f6b351a5048b5a14d52431c833da6ed3386a7bf |