Summary: | High-resolution rendering is very slow on some files | ||
---|---|---|---|
Product: | Ghostscript | Reporter: | Leonid Lukomskij <leonidas> |
Component: | Graphics Library | Assignee: | Michael Vrhel <michael.vrhel> |
Status: | NOTIFIED FIXED | ||
Severity: | normal | CC: | alex, henry.stiles, robin.watts |
Priority: | P2 | ||
Version: | 9.02 | ||
Hardware: | PC | ||
OS: | Windows XP | ||
Customer: | 582 | Word Size: | --- |
Attachments: |
Problem file
A modest speed-up |
Description
Leonid Lukomskij
2011-07-05 18:30:29 UTC
Created attachment 7641 [details]
Problem file
% cumulative self self total time seconds seconds calls s/call s/call name 54.88 233.62 233.62 458996 0.00 0.00 s_IScale_process 25.68 342.92 109.30 1905055701 0.00 0.00 cmsTetrahedralInterp16 4.72 363.01 20.09 35828 0.00 0.01 image_render_interpolate_icc 2.10 371.97 8.96 14130107 0.00 0.00 pdf14_fill_rectangle 1.78 379.54 7.57 540794233 0.00 0.00 art_pdf_composite_pixel_alpha_8 1.25 384.87 5.33 1904849015 0.00 0.00 Pack4Words 1.18 389.90 5.03 227129 0.00 0.00 PrecalculatedXFORM 0.96 393.98 4.08 802 0.01 0.01 pdf14_compose_group 0.73 397.08 3.10 1904289271 0.00 0.00 Unroll3Words 0.66 399.88 2.80 564023 0.00 0.00 gs_memset 0.59 402.41 2.53 179953330 0.00 0.00 gs_memcpy 0.59 404.92 2.51 912714 0.00 0.00 cmsTrilinearInterp16 0.58 407.37 2.45 Pack4BytesSwapSwapFirst 0.58 409.82 2.45 455488 0.00 0.00 calculate_contrib 0.54 412.11 2.29 Unroll3BytesSwap Created attachment 7644 [details]
A modest speed-up
About 7% speed improvement can be easily achieved by re-coding cmsTetrahedralInterp16() more efficiently.
Besides equivalent transformations, this patch replaces division with multiplication and causes large number of trivial differences.
Moving the loops into the ifs makes perfect sense. Replacing the divisions by 'something other than division' would be great, but the proposed change gives a different answer on 50% of all calculations. I think we can get an exact match and remove the division by using alexes patch with the addition of: Rest += 0x8000; before each Output calculation. Pardon me. I meant, the calculation of Rest, which ends in + 0x7fff should be changed to end in 0x8000 to give us an exact match, I believe. Slow rendering is caused by the combination of image interpolation and transparency. On my box the file takes -- 19 min with -dNOTRANSPARENCY -- 4 min with -dNOINTERPOLATE -- 1.5 min with -dNOTRANSPARENCY and -dNOINTERPOLATE -- 6 sec Profile with -dNOTRANSPARENCY time seconds seconds calls s/call s/call name 51.11 44.48 44.48 57725 0.00 0.00 s_IScale_process 30.99 71.45 26.97 479219716 0.00 0.00 cmsTetrahedralInterp16 3.80 74.76 3.31 4439 0.00 0.02 image_render_interpolate_icc 2.52 76.95 2.19 183084 0.00 0.00 gs_memcpy 2.49 79.12 2.17 127833 0.00 0.00 gs_memset 1.54 80.46 1.34 57444 0.00 0.00 PrecalculatedXFORM 1.40 81.68 1.22 479690364 0.00 0.00 Pack4Words 0.90 82.46 0.78 479690306 0.00 0.00 Unroll3Words Profile with -dNOINTERPOLATE time seconds seconds calls s/call s/call name 25.96 8.56 8.56 124286 0.00 0.00 pdf14_fill_rectangle 22.89 16.11 7.55 540742933 0.00 0.00 art_pdf_composite_pixel_alpha_8 13.46 20.55 4.44 802 0.01 0.01 pdf14_compose_group 8.67 23.41 2.86 603800 0.00 0.00 gs_memset 7.85 26.00 2.59 180731988 0.00 0.00 gs_memcpy 5.46 27.80 1.80 18907 0.00 0.00 gx_build_blended_image_row 5.28 29.54 1.74 24509 0.00 0.00 image_render_color_icc 2.43 30.34 0.80 112653530 0.00 0.00 art_pdf_composite_group_8 1.33 30.78 0.44 art_pdf_union_mul_8 0.58 30.97 0.19 49916 0.00 0.00 gs_memmove Profile with -dNOTRANSPARENCY and -dNOINTERPOLATE time seconds seconds calls s/call s/call name 47.80 2.28 2.28 433362 0.00 0.00 gs_memset 33.12 3.86 1.58 793521 0.00 0.00 gs_memcpy 1.68 3.94 0.08 1411935 0.00 0.00 f 1.47 4.01 0.07 470670 0.00 0.00 cmsTrilinearInterp16 1.47 4.08 0.07 602 0.00 0.00 TT_RunIns 1.26 4.14 0.06 470650 0.00 0.00 cmsXYZ2LabEncoded 1.05 4.19 0.05 1472 0.00 0.00 image_render_color_icc 1.05 4.24 0.05 5 0.01 0.07 cmsSample3DGrid 0.84 4.28 0.04 655037 0.00 0.00 cmsTetrahedralInterp16 Interesting that -dNOTRANSPARENCY reduced the number of calls to cmsTetrahedralInterp16 from 1,905,055,701 to 479,219,716. Also tetrahedral interpolation won't be used if we use the 'simple' ps_***.icc profiles that don't have a lookup table. Profiling and timing with the "simple" profiles would be of interest. The issue with this, is that with interpolation we always perform the interpolation in the source color space and then perform the color conversion on the interpolated values. When we do large scale-ups this *dramatically* increases our computational cost. The solution is for us to do the color conversion up-front prior to scaling when we are scaling the image up. I am working on this fix now and should be done in a couple weeks or sooner. So with my fix at 1200dpi we only take 204 seconds now. Cluster testing now. So with the commit http://git.ghostscript.com/?p=ghostpdl.git;a=commit;h=6aa157b438ac69f9732b9f7b29e8570cceb50e8e which makes us perform color management before interpolation if we are doing enlargements and http://git.ghostscript.com/?p=ghostpdl.git;a=commit;h=6e2eda3cca4f8e13a9139c77aad5da524fa62d76 which fixed a float/double mix up in the interpolation code I have the following times for -sDEVICE=tiff32nc -rX -o nul: input_file.pdf at X=1200 it runs in about 200 seconds. at X=2400 it runs in about 61 minutes. if we add the option -dNOINTERPOLATE the 2400 case runs in 200 seconds. When running to very large resolutions like 2400, which will clearly be halftoned, I would suggest not allowing interpolation of images (i.e. use -dNOINTERPOLATE). Closing since we have made significant improvements. Note that at 2400 dpi customer may want to use -dNOINTERPOLATE |