Bug 691840 - image code too slow
Summary: image code too slow
Status: RESOLVED FIXED
Alias: None
Product: GhostPCL
Classification: Unclassified
Component: PCL raster (show other bugs)
Version: unspecified
Hardware: PC All
: P4 normal
Assignee: Henry Stiles
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-12-16 03:16 UTC by Henry Stiles
Modified: 2011-02-24 16:43 UTC (History)
1 user (show)

See Also:
Customer:
Word Size: ---


Attachments
copy_mono.patch (1.48 KB, patch)
2010-12-16 03:19 UTC, Henry Stiles
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Henry Stiles 2010-12-16 03:16:53 UTC
The attached file renders in 18 seconds using the gs_image api and in 9 seconds using copy_mono directly.
Comment 1 Henry Stiles 2010-12-16 03:19:31 UTC
Created attachment 7034 [details]
copy_mono.patch

Quick hack to demonstrate direct copy_mono performance with this test file.  Obviously checks are needed to enable the fast path but that should not significantly effect the time.
Comment 2 Henry Stiles 2010-12-16 03:58:11 UTC
The command line is time pcl6 -r600 -sDEVICE=pbmraw -o /dev/null 1000pages.pcl.
Comment 3 norbert.janssen 2010-12-16 07:18:23 UTC
(In reply to comment #1)
Another optimization can be to let the decompression routines also store the actual valid size in the pseed_rows[] (i.e. size is maximum scanlength (to end of cliprectangle).
Only these valid databytes need to be processed, the remaining data is 0.
This optimizes the rasterscans in the example file from a few 100ds bytes to 6 bytes (only when the ROP is paint, i.e ROP=paint/or dst = dst|src, then white pixels have no effect on destination)
Comment 4 Henry Stiles 2010-12-16 07:40:14 UTC
Hello Norbert, I need the complete collection of files that you have identified as performance problems to decide which direction to take here.  If all or many of the files use images in device space the direct copy_mono is probably what we want to do.  I also need to know the target processor for the final device.  Please note much of the raster code will end up going through copy_mono() which is faster on big endian architectures vs. little endian.  I'm off to bed now but will continue looking at this tomorrow, and will check out your suggestion of storing the actual size of the rows.
Comment 5 norbert.janssen 2010-12-16 14:02:46 UTC
(In reply to comment #4)
> Hello Norbert, I need the complete collection of files that you have identified
> as performance problems to decide which direction to take here.  If all or many
> of the files use images in device space the direct copy_mono is probably what
> we want to do.  I also need to know the target processor for the final device. 
> Please note much of the raster code will end up going through copy_mono() which
> is faster on big endian architectures vs. little endian.  I'm off to bed now
> but will continue looking at this tomorrow, and will check out your suggestion
> of storing the actual size of the rows.

Targetprocessor is Intel T7500/Q9550 XPembedded-32bit.
I will put the files on peeves/homme/norbert/raster_performance/PCLTestFiles.tgz (some of them are huge, i.e. 700MB)

pseed_rows[].size + new pseed_rows[].valid_data (also to be put in clist : cmd_opv_image_data, etc.
Comment 6 Henry Stiles 2010-12-16 19:52:56 UTC
(In reply to comment #5)
> (In reply to comment #4)
> > Hello Norbert, I need the complete collection of files that you have identified
> > as performance problems to decide which direction to take here.  If all or many
> > of the files use images in device space the direct copy_mono is probably what
> > we want to do.  I also need to know the target processor for the final device. 
> > Please note much of the raster code will end up going through copy_mono() which
> > is faster on big endian architectures vs. little endian.  I'm off to bed now
> > but will continue looking at this tomorrow, and will check out your suggestion
> > of storing the actual size of the rows.
> 
> Targetprocessor is Intel T7500/Q9550 XPembedded-32bit.
> I will put the files on
> peeves/homme/norbert/raster_performance/PCLTestFiles.tgz (some of them are
> huge, i.e. 700MB)
> 
> pseed_rows[].size + new pseed_rows[].valid_data (also to be put in clist :
> cmd_opv_image_data, etc.

I think with the copy_mono optimization this seed row size will not be very significant but I could be wrong, we should measure the benefit of each improvement as we go.  I am a little surprised you don't have the memory budget to support a full frame buffer for 600 dpi monochrome, that would really speed things up.  Maybe you are using banding for mt (multithreaded) rendering but I would doubt the mt speedup would trump full-frame with the expected job mix for this device.
Comment 7 norbert.janssen 2010-12-17 07:35:40 UTC
(In reply to comment #6)
> (In reply to comment #5)
> > (In reply to comment #4)
> > > Hello Norbert, I need the complete collection of files that you have identified
> > > as performance problems to decide which direction to take here.  If all or many
> > > of the files use images in device space the direct copy_mono is probably what
> > > we want to do.  I also need to know the target processor for the final device. 
> > > Please note much of the raster code will end up going through copy_mono() which
> > > is faster on big endian architectures vs. little endian.  I'm off to bed now
> > > but will continue looking at this tomorrow, and will check out your suggestion
> > > of storing the actual size of the rows.
> > 
> > Targetprocessor is Intel T7500/Q9550 XPembedded-32bit.
> > I will put the files on
> > peeves/homme/norbert/raster_performance/PCLTestFiles.tgz (some of them are
> > huge, i.e. 700MB)
> > 
> > pseed_rows[].size + new pseed_rows[].valid_data (also to be put in clist :
> > cmd_opv_image_data, etc.
> 
> I think with the copy_mono optimization this seed row size will not be very
> significant but I could be wrong, we should measure the benefit of each
> improvement as we go.  I am a little surprised you don't have the memory budget
> to support a full frame buffer for 600 dpi monochrome, that would really speed
> things up.  Maybe you are using banding for mt (multithreaded) rendering but I
> would doubt the mt speedup would trump full-frame with the expected job mix for
> this device.

But if the seed_row actual size is correctly reported, I can pick this up in the begin_typed_image, and do my own optimized rasterization (only if certain conditions are met: i.e. b&w, orthogonal, none-scaled, ROP, ...).
Note that I also added a flag to the gs_image1_s to specify if it is a characterbitmap (PI_IsCharacter in gximage.c) or not. So I can have different handling in my own gdevoce (character versus rastergfx).

We do support full frame buffer (that is our current implementation and this is without clist), but we are experimenting with multithreaded rendering, to see if we can benefit with multicore CPUs (so for this we need banding/clist). I also implemented a asynchronous interpreting/rendering (using elements from gdevprna.c, gdevbmpa.c, gdevp2up.c). WIth this the interpreter can start with a next page while the current page/clist is being rendered by another thread running a renderingdevice (clist-reader).
So the interpreter delivers a clist which is saved in a list. And a different thread picks this up, open_device(a reader-device)  attaches the clist and a bitmap and calls gdev_prn_render_pages() for this.
This was running already more then a year ago, but in the multi-language (PSI/PCL/XPS) there was (with the memorymanager then used) some weird effect that objects were removed while still being needed by the renderer (I suspected the postscript garbage collection, as I had psi + pcl + xps interpreters enabled; so when the interpreter was done, the pl_dnit_job did some cleanup. while the renderer was still busy).

Another aspect I want to investigate is to not start renderthreads for each page, and after rendering kill them again. But keep them alive for the next page (kind of renderthreadpool), just attach the correct clist and bandmemory (setup_buf_device)). This should also reduce some (thread-create)overhead from the mt-rendering.
And to check if the renderthreads do not hinder each other while processing the same clist, but rendering in different bands. I.e. lockout/mutex on the clist or memorymanager. Because the steps in performance gain when going from NumRenderingThreads 1 - 2 - 3 - 4 is a bit disappointing (it is not always faster to have more threads).
Comment 8 Henry Stiles 2011-02-24 16:43:26 UTC
I think the performance issues in this thread have been resolved.  As usual please reopen if I'm mistaken.