Bug 692323

Summary:	High-resolution rendering is very slow on some files
Product:	Ghostscript	Reporter:	Leonid Lukomskij <leonidas>
Component:	Graphics Library	Assignee:	Michael Vrhel <michael.vrhel>
Status:	NOTIFIED FIXED
Severity:	normal	CC:	alex, henry.stiles, robin.watts
Priority:	P2
Version:	9.02
Hardware:	PC
OS:	Windows XP
Customer:	582	Word Size:	---
Attachments:	Problem file A modest speed-up

Description Leonid Lukomskij 2011-07-05 18:30:29 UTC

I found several files hanging rendering process with high resolution.

I tried resolution 2400 dpi but waiting for result was too long.

So I used follow command line with resolution 1200 dpi:

gswin32c -dBATCH -dSAFER -dNOPAUSE -sDEVICE=tiff32nc -r1200 -sOutputFile=ti.tif "Tiquet de Salida.pdf"

Much more complex files require less time to process. But this file went 35 minutes.

I tried 2400 but killed process after 2 hours.

Comment 1 Leonid Lukomskij 2011-07-05 18:31:33 UTC

Created attachment 7641 [details]
Problem file

Comment 2 Alex Cherepanov 2011-07-05 22:18:14 UTC

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 54.88    233.62   233.62   458996     0.00     0.00  s_IScale_process
 25.68    342.92   109.30 1905055701     0.00     0.00  cmsTetrahedralInterp16
  4.72    363.01    20.09    35828     0.00     0.01  image_render_interpolate_icc
  2.10    371.97     8.96 14130107     0.00     0.00  pdf14_fill_rectangle
  1.78    379.54     7.57 540794233     0.00     0.00  art_pdf_composite_pixel_alpha_8
  1.25    384.87     5.33 1904849015     0.00     0.00  Pack4Words
  1.18    389.90     5.03   227129     0.00     0.00  PrecalculatedXFORM
  0.96    393.98     4.08      802     0.01     0.01  pdf14_compose_group
  0.73    397.08     3.10 1904289271     0.00     0.00  Unroll3Words
  0.66    399.88     2.80   564023     0.00     0.00  gs_memset
  0.59    402.41     2.53 179953330     0.00     0.00  gs_memcpy
  0.59    404.92     2.51   912714     0.00     0.00  cmsTrilinearInterp16
  0.58    407.37     2.45                             Pack4BytesSwapSwapFirst
  0.58    409.82     2.45   455488     0.00     0.00  calculate_contrib
  0.54    412.11     2.29                             Unroll3BytesSwap

Comment 3 Alex Cherepanov 2011-07-06 17:12:16 UTC

Created attachment 7644 [details]
A modest speed-up

About 7% speed improvement can be easily achieved by re-coding cmsTetrahedralInterp16() more efficiently.

Besides equivalent transformations, this patch replaces division with multiplication and causes large number of trivial differences.

Comment 4 Robin Watts 2011-07-06 18:25:50 UTC

Moving the loops into the ifs makes perfect sense.

Replacing the divisions by 'something other than division' would be great, but the proposed change gives a different answer on 50% of all calculations.

Comment 5 Robin Watts 2011-07-06 18:35:50 UTC

I think we can get an exact match and remove the division by using alexes patch with the addition of:

 Rest += 0x8000;

before each Output calculation.

Comment 6 Robin Watts 2011-07-06 18:39:30 UTC

Pardon me. I meant, the calculation of Rest, which ends in + 0x7fff should be changed to end in 0x8000 to give us an exact match, I believe.

Comment 7 Alex Cherepanov 2011-07-17 20:17:37 UTC

Slow rendering is caused by the combination of image interpolation and transparency.

On my box the file takes  -- 19 min
with -dNOTRANSPARENCY     --  4 min
with -dNOINTERPOLATE      -- 1.5 min
with -dNOTRANSPARENCY and
     -dNOINTERPOLATE      --  6 sec

Profile with -dNOTRANSPARENCY
time   seconds   seconds    calls   s/call   s/call  name    
 51.11     44.48    44.48    57725     0.00     0.00  s_IScale_process
 30.99     71.45    26.97 479219716     0.00     0.00  cmsTetrahedralInterp16
  3.80     74.76     3.31     4439     0.00     0.02  image_render_interpolate_icc
  2.52     76.95     2.19   183084     0.00     0.00  gs_memcpy
  2.49     79.12     2.17   127833     0.00     0.00  gs_memset
  1.54     80.46     1.34    57444     0.00     0.00  PrecalculatedXFORM
  1.40     81.68     1.22 479690364     0.00     0.00  Pack4Words
  0.90     82.46     0.78 479690306     0.00     0.00  Unroll3Words

Profile with -dNOINTERPOLATE
 time   seconds   seconds    calls   s/call   s/call  name    
 25.96      8.56     8.56   124286     0.00     0.00  pdf14_fill_rectangle
 22.89     16.11     7.55 540742933     0.00     0.00  art_pdf_composite_pixel_alpha_8
 13.46     20.55     4.44      802     0.01     0.01  pdf14_compose_group
  8.67     23.41     2.86   603800     0.00     0.00  gs_memset
  7.85     26.00     2.59 180731988     0.00     0.00  gs_memcpy
  5.46     27.80     1.80    18907     0.00     0.00  gx_build_blended_image_row
  5.28     29.54     1.74    24509     0.00     0.00  image_render_color_icc
  2.43     30.34     0.80 112653530     0.00     0.00  art_pdf_composite_group_8
  1.33     30.78     0.44                             art_pdf_union_mul_8
  0.58     30.97     0.19    49916     0.00     0.00  gs_memmove

Profile with -dNOTRANSPARENCY and -dNOINTERPOLATE 
 time   seconds   seconds    calls   s/call   s/call  name    
 47.80      2.28     2.28   433362     0.00     0.00  gs_memset
 33.12      3.86     1.58   793521     0.00     0.00  gs_memcpy
  1.68      3.94     0.08  1411935     0.00     0.00  f
  1.47      4.01     0.07   470670     0.00     0.00  cmsTrilinearInterp16
  1.47      4.08     0.07      602     0.00     0.00  TT_RunIns
  1.26      4.14     0.06   470650     0.00     0.00  cmsXYZ2LabEncoded
  1.05      4.19     0.05     1472     0.00     0.00  image_render_color_icc
  1.05      4.24     0.05        5     0.01     0.07  cmsSample3DGrid
  0.84      4.28     0.04   655037     0.00     0.00  cmsTetrahedralInterp16

Comment 8 Ray Johnston 2011-07-18 01:49:42 UTC

Interesting that -dNOTRANSPARENCY reduced the number of calls to 
cmsTetrahedralInterp16 from 1,905,055,701 to 479,219,716.

Also tetrahedral interpolation won't be used if we use the 'simple'
ps_***.icc profiles that don't have a lookup table. Profiling and timing
with the "simple" profiles would be of interest.

Comment 9 Michael Vrhel 2011-09-12 01:45:08 UTC

The issue with this, is that with interpolation we always perform the interpolation in the source color space and then perform the color conversion on the interpolated values.  When we do large scale-ups this *dramatically* increases our computational cost.  The solution is for us to do the color conversion up-front prior to scaling when we are scaling the image up.  I am working on this fix now and should be done in a couple weeks or sooner.

Comment 10 Michael Vrhel 2011-09-14 18:07:40 UTC

So with my fix at 1200dpi we only take 204 seconds now.  Cluster testing now.

Comment 11 Michael Vrhel 2011-09-20 04:40:33 UTC

So with the commit 

http://git.ghostscript.com/?p=ghostpdl.git;a=commit;h=6aa157b438ac69f9732b9f7b29e8570cceb50e8e   which makes us 
perform color management before interpolation if we are
doing enlargements

and

http://git.ghostscript.com/?p=ghostpdl.git;a=commit;h=6e2eda3cca4f8e13a9139c77aad5da524fa62d76
which fixed a float/double mix up in the interpolation code

I have the following times for 

-sDEVICE=tiff32nc -rX -o nul: input_file.pdf

at X=1200 it runs in about 200 seconds.
at X=2400 it runs in about 61 minutes.

if we add the option -dNOINTERPOLATE the 2400 case runs in 200 seconds.  When running to very large resolutions like 2400, which will clearly be halftoned, I would suggest not allowing interpolation of images (i.e. use -dNOINTERPOLATE).

Comment 12 Michael Vrhel 2011-09-20 17:39:30 UTC

Closing since we have made significant improvements.   Note that at 2400 dpi customer may want to use -dNOINTERPOLATE