693333 – pxlmono generates huge PCL files from some PDF files

Bug 693333 - pxlmono generates huge PCL files from some PDF files

Summary: pxlmono generates huge PCL files from some PDF files

Status:	RESOLVED WONTFIX

Alias:	None

Product:	Ghostscript
Classification:	Unclassified
Component:	PXL Driver (show other bugs)
Version:	9.05
Hardware:	PC Linux

Importance:	P4 normal
Assignee:	Default assignee

URL:
Keywords:	bountiable

Depends on:
Blocks:

Reported:	2012-09-14 10:10 UTC by Aimadati
Modified:	2023-05-23 15:40 UTC (History)
CC List:	5 users (show)

See Also:
Customer:
Word Size:	---

Attachments
Test files test1.pdf and test2.pdf (711.83 KB, application/x-gzip) 2012-09-14 10:10 UTC, Aimadati	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Aimadati 2012-09-14 10:10:52 UTC

Created attachment 8935 [details]
Test files test1.pdf and test2.pdf

Some PDF files generate very big PCL files when converting them using pxlmono (using Ghostscript 9.05 on Centos 6.2)
I have read a similar bug 692329 (http://bugs.ghostscript.com/show_bug.cgi?id=692329), and it seems that in that case the problem was related to colorspace and ICC of the image contained in the example pdf. In this case, the huge size issue happens with different types of images present in the pdf files. I include two example files (extracted from larger ones, where the problem becomes more severe)

1) The attached test1.pdf file generates a 218.8MB PCL file whith the following command:
gs -dNOPAUSE -dBATCH -dPARANOIDSAFER -dNOINTERPOLATE -sDEVICE=pxlmono -sOutputFile=test1.pcl -f test1.pdf

(The same file generates just 917.2KB with -sDEVICE=ljet4)

The test1.pdf contains two images with the following info:
<</BitsPerComponent 1/ColorSpace/DeviceGray/Decode[0.0 1.0]/Filter/JBIG2Decode/Height 6800/Length 396491/Name/X/Subtype/Image/Type/XObject/Width 4672>
<</BitsPerComponent 1/ColorSpace/DeviceGray/Decode[0.0 1.0]/Filter/JBIG2Decode/Height 6800/Length 1185/Name/X/Subtype/Image/Type/XObject/Width 4672>

2) The attached test2.pdf file generates a 14.4MB PCL with the following command:
gs -dNOPAUSE -dBATCH -dPARANOIDSAFER -dNOINTERPOLATE -sDEVICE=pxlmono -sOutputFile=test2.pcl -f test2.pdf

(The same file generates just 394.2KB with -sDEVICE=ljet4)

The test2.pdf contains three images with the following info:
<</Type/XObject/Subtype/Image/Width 674/Height 674/ColorSpace/DeviceRGB/BitsPerComponent 8/Filter/DCTDecode/Interpolate true/Length 17598>
<</Type/XObject/Subtype/Image/Width 1856/Height 406/ColorSpace/DeviceGray/Matte[0 0 0]/BitsPerComponent 8/Interpolate false/Filter/FlateDecode/Length 90884>
<</Type/XObject/Subtype/Image/Width 1856/Height 406/ColorSpace/DeviceRGB/BitsPerComponent 8/Interpolate false/SMask 4 0 R/Filter/FlateDecode/Length 170906>

Comment 1 Hin-Tak Leung 2012-09-19 17:19:37 UTC

FWIW, the size with neither are ICC-related. using r11305 (just before the icc merge), test1 results in 229.4M and test2 in 83.4M. Current-ish head (plus possibly some uncommitted changes I have) gives  229.4M and 1.8M respectively;
and the 2nd file can be improved by using -dCompressMode=3 to down to 1.4M.

So my preliminary analysis would say test1 has always been poor; but test2 is either already fixed, or a fix is available among the "some uncommitted changes".

Somebody else please feel free to test unmodified head.

AFAIC, test1 is because jbig is a lot better than other compressions for certain type of images; and adding -dCompressMode=3 helps inputs with DCTDecode (i.e. jpegs) and possibly other "smooth images".

adding -r72 gives 8M and 1.38M respectively so the issue is somewhat resolution related. I suspect ljet4 does not have the same default resolution.

Comment 2 James Cloos 2012-09-20 00:35:38 UTC

Both seem to default to -r600.  Adding -r600 to ljet4 doesn’t change the output and the pxlmono output says:

  @PJL SET RESOLUTION=600

Interestingly, the second page of test1.pdf, which has a mostly white background images, causes most of the pain in the pxlmono output, even though ljet4 outputs a smaller file for page 2:

   908 -rw-r--r-- 1 cloos cloos    927057 Sep 19 18:50 test1-1.pcl
   928 -rw-r--r-- 1 cloos cloos    946822 Sep 19 18:49 test1-1.pxl
    12 -rw-r--r-- 1 cloos cloos     12117 Sep 19 18:50 test1-2.pcl
223108 -rw-r--r-- 1 cloos cloos 228456262 Sep 19 18:49 test1-2.pxl

For page 1, the pxl is only slightly larger, but it falls apart for page 2.

Using -dCompressMode=3 helps page 1 a bit, but doesn’t much change page 2.

Having just remembered pxldis.py, the reason page2.pxl is so large is that instead of outputing an image, it outputs rectangles:

uint16_box 1173 48 1174 50 BoundingBox
// Op Pos: 861  Op fOff: 10473  Op Hex: A0  Level: 0
Rectangle

pxldis.py is still going (cue rabbit), but it looks like one Rectangle per pixel of the pdf’s 400-dpi image.


The difference comes down to gdev_vector_fill_parallelogram() vs what ljet4 gets from its NULL fill_rectangle proc, which itself gets called due to ljet4’s NULL fill_parallelogram proc.

It looks like something clips the image, resulting in the one-rect-per-pixel behaviour.

(Removing the “the page intentionally left blank” annot didn’t change the output size.)

Comment 3 James Cloos 2012-09-20 18:01:34 UTC

Another note:

For test1 (the bitmap images), the difference shows up when pclxl_begin_image() is called.  After the matrix invert and multiply, for page 1 (which outputs an image), mat.xy == mat.yx == 0.0.  For page 2 (which outputs the pixels as rectangles), mat.xy = -0.00313569396 and mat.yx = 0.00313568092.

This comes from the matrix multiply with ctm_only(pis):

(page1) p ctm_only(pis)
$7 = {xx = 7008, xy = 0, yx = 0, yy = -10200, tx = 0, ty = 10200}

(page2) p ctm_only(pis)
{xx = 7008, xy = -14.6499624, yx = -21.3226318, yy = -10200, tx = 10.6613159, ty = 10207.3252}

I think that is due to the contents of objects 25 vs 2 in test1.pdf:

25 0 obj
stream
q 840.9600067 0 0 1224 0 0 cm /Im0 Do Q endstream
endobj

2 0 obj
stream
q 840.9600067 1.7579956 -2.5587158 1224 1.2793579 -0.8789978 cm /Im0 Do Q endstream
endobj

The second image isn’t quite aligned to the pixel grid.


There are a couple of options to fix this:

gs could ignore tiny rotations or skews and output a simple upright image.

it could render such images to an upright rectangle and output that as an image.

and there was another which I thought of while my tea was brewing, but which I’ve already lost. ☹

In this particular case, not outputting white pixels also would reduce the file size enormously.


Do you have a preference on a style of fix for this?

Comment 4 Hin-Tak Leung 2012-09-20 19:21:23 UTC

(In reply to comment #3)
...  For page 2 (which outputs the pixels as
> rectangles), mat.xy = -0.00313569396 and mat.yx = 0.00313568092.

That would be about 1 in 300, which is probably noticeable for a whole page - 11 inches at 600 dpi, worst case scenario is about 22 pixels off, 1/30 of an inch, or 1mm thereabouts.

> In this particular case, not outputting white pixels also would reduce the file
> size enormously.

That would be probably wrong, or at least needed to be done properly - images can be part of a ROP3 group - i.e. depends on what the current ROP3 stage is, white part can be used to indicate (1) transparent - i.e. showing the color "below", (2) painted with the current paint/brush color/pattern, (3) something else... so not outputting them would be wrong.

Comment 5 Hin-Tak Leung 2012-09-23 20:23:54 UTC

(In reply to comment #4)
> (In reply to comment #3)
> ...  For page 2 (which outputs the pixels as
> > rectangles), mat.xy = -0.00313569396 and mat.yx = 0.00313568092.
> 
> That would be about 1 in 300, which is probably noticeable for a whole page -
> 11 inches at 600 dpi, worst case scenario is about 22 pixels off, 1/30 of an
> inch, or 1mm thereabouts.

There is one more reason why one needs to be careful about trying to grid-fit approximate alignment: tilling and band modes. The user content may have tilling and alignments would could be noticeble (e.g. 1-pixel-wide gaps), and also ghostscript's graphic core renders large images and send to the PXL driver in bands.

Comment 6 Henry Stiles 2012-09-25 19:30:03 UTC

(In reply to comment #3)
> Another note:
> 
> For test1 (the bitmap images), the difference shows up when pclxl_begin_image()
> is called.  After the matrix invert and multiply, for page 1 (which outputs an
> image), mat.xy == mat.yx == 0.0.  For page 2 (which outputs the pixels as
> rectangles), mat.xy = -0.00313569396 and mat.yx = 0.00313568092.
> 

Interesting, page 2 is mostly white so the graphics library should have aggrandized the pixels into larger rectangles of a single color.  Are we getting adjacent rectangle of the same color or maybe the color changes imperceptibly and I don't see it in the output?  I haven't looked at the XL code generated.

Comment 7 James Cloos 2012-09-25 20:08:10 UTC

I’m a bit rushed, so just a summary.

The pxl contains one rectangle blob for each pixel in the pdf’s image.

The code made me think that turning NOINTERPOLATE off might avoid that, but it doesn’t.

Outputting larger rectangles for monochromatic sections is the third option I had thought of (and then lost).

Even just one image (rectangle iff monochromatic) per row would be significantly better.

To account for the rotation/skew, ideal is probably one image (r iff m) for the rows which start at column X, another for those which start at column X+1, etc.

Alternatively, even though it is a vector output device, switching to raster output style just for the image — and back to vector to set the text atop the image — seems like the most efficient output.

Based on comment #6 I take it that gs already ought to be generating larger rectangles.  I don’t have time right now to look at the code again, but I don’t remember whether it tried to do that.  I left a run in gdb, it is currently at:

#0  pclxl_begin_image (dev=0x1c882c0, pis=0x19215b0, pim=0x7fffffff7e70, format=gs_image_format_chunky, prect=0x0, pdcolor=0x1b19bd8, pcpath=0x1b19868, mem=0x1901ae0, pinfo=0x7fffffff7a10)
    at ../gs/base/gdevpx.c:1794
#1  0x00000000008cdf00 in gx_default_begin_typed_image (dev=0x1c882c0, pis=0x19215b0, pmat=0x0, pic=0x7fffffff7e70, prect=0x0, pdcolor=0x1b19bd8, pcpath=0x1b19868, memory=0x1901ae0, pinfo=0x7fffffff7a10)
    at ../gs/base/gdevddrw.c:1023
#2  0x0000000000844787 in gs_image_begin_typed (pic=0x7fffffff7e70, pgs=0x19215b0, uses_color=0, ppie=0x7fffffff7a10) at ../gs/base/gsimage.c:244
#3  0x000000000052669b in zimage_setup (i_ctx_p=0x19268c0, pim=0x7fffffff7e70, sources=0x7fffffff7a58, uses_color=0, npop=1) at ../gs/psi/zimage.c:179
#4  0x00000000005268fa in image1_setup (i_ctx_p=0x19268c0, has_alpha=0) at ../gs/psi/zimage.c:242
#5  0x0000000000526919 in zimage1 (i_ctx_p=0x19268c0) at ../gs/psi/zimage.c:253
#6  0x00000000004db035 in do_call_operator (op_proc=0x5268fc <zimage1>, i_ctx_p=0x19268c0) at ../gs/psi/interp.c:86
#7  0x00000000004dd39b in interp (pi_ctx_p=0x1901630, pref=0x7fffffff8b10, perror_object=0x1901618) at ../gs/psi/interp.c:1174
#8  0x00000000004db808 in gs_call_interp (pi_ctx_p=0x1901630, pref=0x7fffffff8b10, user_errors=0, pexit_code=0x7fffffff8dac, perror_object=0x1901618) at ../gs/psi/interp.c:501
#9  0x00000000004db635 in gs_interpret (pi_ctx_p=0x1901630, pref=0x7fffffff8b10, user_errors=0, pexit_code=0x7fffffff8dac, perror_object=0x1901618) at ../gs/psi/interp.c:459
#10 0x00000000004cf044 in gs_main_interpret (minst=0x1901598, pref=0x7fffffff8b10, user_errors=0, pexit_code=0x7fffffff8dac, perror_object=0x1901618) at ../gs/psi/imain.c:235
#11 0x00000000004cfd0f in gs_main_run_string_continue (minst=0x1901598, str=0x7fffffff8b90 "<2F746D702F67735F4F4641626778> run\n", length=35, user_errors=0, pexit_code=0x7fffffff8dac, perror_object=0x1901618)
    at ../gs/psi/imain.c:598
#12 0x00000000004d472f in gsapi_run_string_continue (lib=0x18dd1c0, str=0x7fffffff8b90 "<2F746D702F67735F4F4641626778> run\n", length=35, user_errors=0, pexit_code=0x7fffffff8dac) at ../gs/psi/iapi.c:210
#13 0x000000000040530e in ps_impl_dnit_job (instance=0x19013b8) at ../psi/psitop.c:541
#14 0x00000000008f5653 in pl_dnit_job (instance=0x19013b8) at ../pl/pltop.c:204
#15 0x0000000000956abf in close_job (universe=0x7fffffff9100, pti=0x7fffffffa250) at ../pl/plmain.c:212
#16 0x0000000000957086 in pl_main_aux (argc=7, argv=0x7fffffffa488, disp=0x0) at ../pl/plmain.c:374
#17 0x000000000095784f in pl_main (argc=7, argv=0x7fffffffa488) at ../pl/plmain.c:516
#18 0x0000000000956a00 in main (argc=7, argv=0x7fffffffa488) at ../pl/realmain.c:19

Comment 8 Hin-Tak Leung 2012-09-25 20:38:34 UTC

(In reply to comment #7)
> I’m a bit rushed, so just a summary.
> 
> The pxl contains one rectangle blob for each pixel in the pdf’s image.
> 
> The code made me think that turning NOINTERPOLATE off might avoid that, but it
> doesn’t.
> 
> Outputting larger rectangles for monochromatic sections is the third option I
> had thought of (and then lost).
<snipped>

Yes and no. The graphic core sends image bands or strips (explained in comment 5) to the driver, and its attributes, like the interpololate flag you mentioned. If all the combinations of such are compatible with PCL's imaging model, it is rendered as a PXL image. Otherwise the driver bounces it back to the graphic core and it gets re-sent as individual pixels. In a nutshell.

The conditions that a core image band can be rendered as PCL image, are a combinations of things: the interpolate flag, etc. One possibility I just think of is odd color-depths - PCL only does 1, 8, 24, so any odd color-depths, plus alpha channels, are always rendered as individual pixels.

And you still need to address the fact that an image can be part of an ROP3 group, and white, etc does not always means "white".

Comment 9 Henry Stiles 2012-09-25 20:57:26 UTC

(In reply to comment #8)
> (In reply to comment #7)
> > I’m a bit rushed, so just a summary.
> > 
> > The pxl contains one rectangle blob for each pixel in the pdf’s image.
> > 
> > The code made me think that turning NOINTERPOLATE off might avoid that, but it
> > doesn’t.
> > 
> > Outputting larger rectangles for monochromatic sections is the third option I
> > had thought of (and then lost).
> <snipped>
> 
> Yes and no. The graphic core sends image bands or strips (explained in comment
> 5) to the driver, and its attributes, like the interpololate flag you
> mentioned. If all the combinations of such are compatible with PCL's imaging
> model, it is rendered as a PXL image. Otherwise the driver bounces it back to
> the graphic core and it gets re-sent as individual pixels. In a nutshell.
> 
> The conditions that a core image band can be rendered as PCL image, are a
> combinations of things: the interpolate flag, etc. One possibility I just think
> of is odd color-depths - PCL only does 1, 8, 24, so any odd color-depths, plus
> alpha channels, are always rendered as individual pixels.
>

Once the image is punted back to the library the library should not be creating one rectangle per pixel, unless the color changes every pixel.  It should be creating a rectangle the size of the contiguous like color pixels.  Is the color changing every pixel?

Comment 10 James Cloos 2012-09-26 07:04:42 UTC

It doesn’t change that often.

There are 24644 black and 31744956 white pixels in the (400 dpi) image.

Also, it isn’t quite one rect per source pixel.  There are just under three rects for each five source pixels.  (My earlier comment was a guess; the disassembly took nearly an hour and was still going when I posted.)

There are stray set pixels in the image which don’t show up when viewing the image at 100 dpi, but 6100 of the 6800 rows are all white.

There are 1626 SetBrushSource commands in the disassembly, alternative between 0 and 255.

Comment 11 Hin-Tak Leung 2012-10-30 05:25:31 UTC

(In reply to comment #10)
<snipped>
> There are stray set pixels in the image which don’t show up when viewing the
> image at 100 dpi, but 6100 of the 6800 rows are all white.
<snipped>

Regardless of how often pixels are white, you still cannot omit them - already explained that 'white' can means "transparent" (i.e. show background color/image) or 'pattern' or what not...

Comment 12 Binaria Digital 2014-03-06 02:28:26 UTC

Hi,
Have you found any solution to this bug? Any progress? Do you plan to correct it in next versions? 
(I recently checked test1.pdf file whit ghostscript 9.10 (current latest version) and got 219MB PCL file, as in previous versions)

Comment 13 Hin-Tak Leung 2014-03-06 12:43:20 UTC

Like I said a while ago, test2 is no longer an issue with 9.10. test1 is probably going to continue to be a issue as the full-page image is probably semi-transparent? - the text in the middle is rendered from a courier font.

Comment 14 Hin-Tak Leung 2014-03-07 06:03:50 UTC

I checked all the conditions about emitting high-level image structs. I think the one and only issue with test1 is already discussed in comment 3 and 4: the image on page 2 is slightly not quite aligned to the pixel grid. What you could do, is to (1) snap to grid for a 'mis-align' tolerance you decide.

Then, the pxl driver would then output high-level image structs, but in strips, for the sort of full-page images you have. So there will be tears between strips. To avoid tears, you could set -dMaxBitmap=<largenumber> - but your entire page will be slightly rotated - and potentially the rotation - about 1mm is noticeable/important; also memory consumption will go up with -dMaxBitmap=<largenumber> . So you choice is between (2a) tears at regular intervals, or (2b) memory consumption and noticeable misalignment at top/bottom of page, depending on the value of -dMaxBitmap=<largenumber>.

Comment 15 Hin-Tak Leung 2014-03-07 07:30:06 UTC

As a proof of concept, I added this snipplet:

====================================
--- a/gs/devices/vector/gdevpx.c
+++ b/gs/devices/vector/gdevpx.c
@@ -1814,6 +1814,11 @@ pclxl_begin_image(gx_device * dev,
      * These have one of the diagonals being zeros
      * (and the other diagonals having non-zeros).
      */
+    if (fabs((mat.xy * mat.yx) / (mat.xx * mat.yy)) < 5e-6) {
+      if_debug1('|', "tol %f\n", fabs(mat.xy * mat.yx / mat.xx * mat.yy));
+      mat.xy = 0.0;
+      mat.yx = 0.0;
+    }
     if ((!((mat.xx * mat.yy != 0) && (mat.xy == 0) && (mat.yx == 0)) &&
          !((mat.xx == 0) && (mat.yy == 0) && (mat.xy * mat.yx != 0))) ||
         (pim->ImageMask ?
================================

and indeed the ouput is 1.03MB; and it does 8 strips per page. I would suggest the original poster chooses a magic tolerance (5e-6 above), and decide whether tearing into 8 strips (potentially visible/unsightly) is acceptable, or a higher memory requirement with -dMaxBitmap to emit a single slightly mis-aligned image is preferred.

The patch also needs some extra work for *almost*-90 degree rotated images, and needs protection against division by zero.

However, this is a rather special application for *slightly misaligned*, and *full-page* images - scans of old books in this case, with no text and no alignment with any other page element above/below/on-the-side to worry about. 1/30 inch is about 2-3pt, or half a typical character's width off. If you instead have captions over/under almost-full-images, it would be visually noticeable i.e. left side of caption higher by this amount to right side, etc.

Being "approximately" correct or "almost" aligned isn't acceptable for general usage, so I think the general issue with files like test1.pdf is unfixable, and this should be closed as WONTFIX.

Comment 16 Ryszard Trojnacki 2018-03-13 01:56:42 UTC

I have a similar problem with pclxl_begin_image, but in my case mat.xx=-0.000000, mat.yy=-0.000000:
Added in code:
    if_debug4('o', "mat.xx=%f, mat.yy=%f, mat.xy=%f, mat.yx=%f\n", mat.xx, mat.yy, mat.xy, mat.yx);
    if_debug3('o', "(mat.xx == 0) => %d, (mat.yy == 0) => %d, (mat.xy * mat.yx != 0) => %d\n", (mat.xx == 0),  (mat.yy == 0), (mat.xy * mat.yx != 0));
Got output:
    mat.xx=-0.000000, mat.yy=-0.000000, mat.xy=-2.000190, mat.yx=2.000201
    (mat.xx == 0) => 0, (mat.yy == 0) => 0, (mat.xy * mat.yx != 0) => 1

I have added additional condition:
    if( (fabs(mat.xx)<0.000001) && (fabs(mat.yy)<0.000001) ) {
        if_debug0('o', "Zero fix\n");
        mat.xx=0.0;
        mat.yy=0.0;
    }
And now my file is small. From 41 MB (2 pages PDF) to 3MB.

I'm not sure if this makes any diffrence on output, but the epsilon here is less than 0.000001.

Comment 17 Hin-Tak Leung 2018-03-13 04:14:00 UTC

(In reply to Ryszard Trojnacki from comment #16)

> I have added additional condition:
>     if( (fabs(mat.xx)<0.000001) && (fabs(mat.yy)<0.000001) ) {
...

> I'm not sure if this makes any diffrence on output, but the epsilon here is
> less than 0.000001.

This isn't quite the correct maths. If it were done more or less correctly, you would want to do something like this:

(1) mat.xx and mat.yy is a pair ; mat.xy and mat.yx is the other pair.
(2) when it is approximately orthorgonal, both of one of a pair is large, and both of one of the other pair is small.

i.e. mat.xx ~ mat.yy << mat.xy ~ mat.yx
or  mat.xx ~ mat.yy >> mat.xy ~ mat.yx

(I am skipping all the absolute signs for simplicity - we are talking about absolute values - removing any signs - in all this discussion).

(3) so you want to order and rank the 4 numbers by magnitude, the compare
the 2nd and the 3rd in rank of sizes. use that as a tolerance.


FWIW, since my last message, I have thought of a "correct" way of fixing this - it should be possible to collect all the little rectangles and put them into a image, after the fact. However, this would take quite substantial engineering/testing time, which I haven't got round to do in the near future.

Comment 18 Peter Cherepanov 2021-01-03 00:37:12 UTC

The test1.pdf file still generates a 218M PCL file.