Bug 695374 - pcl text performance creep
Summary: pcl text performance creep
Status: NOTIFIED WORKSFORME
Alias: None
Product: GhostPCL
Classification: Unclassified
Component: PCL interpreter (show other bugs)
Version: 9.14
Hardware: PC Windows 7
: P2 enhancement
Assignee: Ray Johnston
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-07-18 01:26 UTC by norbert.janssen
Modified: 2014-10-19 19:01 UTC (History)
1 user (show)

See Also:
Customer: 661
Word Size: ---


Attachments
comparison of vs2013 profiling data (136.05 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
2014-07-21 23:41 UTC, norbert.janssen
Details
comparison of VS2013 profiling data (166.50 KB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
2014-07-22 00:32 UTC, norbert.janssen
Details
timing.pdf (248.82 KB, image/png)
2014-07-24 16:49 UTC, Marcos H. Woehrmann
Details

Note You need to log in before you can comment on or make changes to this bug.
Description norbert.janssen 2014-07-18 01:26:37 UTC
I uploaded a testfile (25pages.pcl) and some profiling data (for 9.04, 9.05, 9.06 and 9.14) on peeves/myhomedirectory/20140721_profiledata.
We experience an increasing loss in performance in printing the testfile when going from 9.04, 9.05, 9.06 to 9.14
It's a stupid pcl-file, but I think the problem is in the number of calls to the copy_portrait routing. 9.06 -> 9.14 is 20% more calls.
And I think that in 9.14 the characters are blitted 1 scan at a time.
Can you confirm this?
Comment 1 Marcos H. Woehrmann 2014-07-18 11:26:53 UTC
I ran this file with the releases of GhostPCL after 9.03 and see a 22% slowdown between 9.07 and 9.10.  However, 9.14 (and the current master) are back to the same speed as 9.03 through 9.07.

If there is interest I can isolate which commits caused and then fixed the slowdown.
Comment 2 norbert.janssen 2014-07-20 22:57:07 UTC
I know 9.07-9.10 were slower, but I also see that 9.14 is 4% slower than 9.06 and 9.06 is 3% slower than 9.05
I did run with the following commandline:
-sDEVICE=bit -r600 -dNOPAUSE -dLeadingEdge=3 -sOutputFile=nul 25pages.pcl

It could be that the LeadingEdge=3 (equal to LongEdgeFeed paper) is causing this? But that is the mode we normally is it in our controller.

What I find strange is that the number of calls to show_proceed is 20% up in 9.14 w.r.t. 9.06 (see the profile-data on peeves).
Comment 3 Henry Stiles 2014-07-21 21:39:22 UTC
(In reply to norbert.janssen from comment #2)
> I know 9.07-9.10 were slower, but I also see that 9.14 is 4% slower than
> 9.06 and 9.06 is 3% slower than 9.05
> I did run with the following commandline:
> -sDEVICE=bit -r600 -dNOPAUSE -dLeadingEdge=3 -sOutputFile=nul 25pages.pcl
> 
> It could be that the LeadingEdge=3 (equal to LongEdgeFeed paper) is causing
> this? But that is the mode we normally is it in our controller.
> 
> What I find strange is that the number of calls to show_proceed is 20% up in
> 9.14 w.r.t. 9.06 (see the profile-data on peeves).

I could not reproduce the call count increase on linux with gprof.  And even so that wouldn't account for the bulk of the slowdown which seems to have occurred between 9.05 and 9.06.  It might be difficult to isolate 3% across 300 revisions but I'll have a look tomorrow.
Comment 4 Henry Stiles 2014-07-21 22:33:44 UTC
Oh, I see you are reporting a 4% slowdown between 9.06 and 9.14, and another 3% slowdown from 9.05 to 9.06?  I've reproduced a total slowdown from 9.05 to 9.14 of 4%, but 9.06 improved upon 9.05 by a small amount (less than a second):

from an i7 linux box:

9.05 19.5s
9.06 18.8s
9.15 20.5s
Comment 5 Ray Johnston 2014-07-21 23:06:04 UTC
May I suggest that -dUseFastColor=true be used. I don't know if it is
reasonable for this printer company, but many other printer companies that have
post rendering color correction or other processing prefer the faster method
that does not rely on Ghostscript internal color management and does not invoke
the lcms(2)ICC color management.
Comment 6 norbert.janssen 2014-07-21 23:41:02 UTC
Created attachment 11064 [details]
comparison of vs2013 profiling data

Comparison of 9.04-9.05, 9.05-9.06, 9.06-9.14 (for functioncalls, time spent in a function, and %time spent in function incl called functions
Comment 7 norbert.janssen 2014-07-21 23:41:15 UTC
I don't know whether gprof does sampling (i.e. at fixed intervals determine which function it is in) or uses instrumentation (monitor each function entry/exit).
With VS2013 sampling (fastest mode) I also got a different profile, but with instrumentation I see this increase in calls to show_proceed.

Attached a excel-sheet with some comparison results from the profiles.

I tried the -dUseFastColor=true, but it didn't give any improvement.
Comment 8 norbert.janssen 2014-07-22 00:32:14 UTC
Created attachment 11065 [details]
comparison of VS2013 profiling data

profiling data comparison. Includes -dUseFastColor (9.14)
Comment 9 Henry Stiles 2014-07-22 09:04:19 UTC
Marcos has agreed to compile a timing history after which he will assign this back to me.
Comment 10 Marcos H. Woehrmann 2014-07-24 16:49:34 UTC
Created attachment 11077 [details]
timing.pdf

The attached chart shows the timing of GhostPCL for the following command line:

pcl6 -sDEVICE=bit -r600 -dNOPAUSE -dLeadingEdge=3 \
     -sOutputFile=/dev/null ./25pages.pcl

The vertical scale is seconds and the horizontal scale is an arbitrary commit id that increases by 1 per commit.  

The references on the chart are:

A 8909 58937f6debfbed7675a0ce5cb8d0aa629e3fa7b8   7.46    6.84  
B 8914 8ebb6896c6f7ba59a6e7dec1267a65a66f817f1c   8.01   20.22 
C 9524 c561232cf26e060b89fc4f3bd4bf5c679731d4db   9.69    4.54  
D 9533 dcc172feb018bfe0613f90ce76ce25dd1eafe3ce   9.92    6.35  
E 9537 7276c1f94cfa55aed0554d45858652ee1cac2097  10.54   -5.98 
F 9539 1969b28c60d666b5eab0050169b5b9aae1a28db5   9.95    8.54  
G 9540 2d2cc321df4223fd7cf64946ac2fa963fa1cea3d  10.80   -9.17 
H 9542 68fa228ade5d2f7496b96676aaa1f82be4584362   9.78  -13.29
I 9547 31ca73d856f974c7e1422948b2fb5794af55bf58   8.44   -6.87 
J 9567 edd06c70150166780f1360a42499569f3a52903e   7.91   19.34 
K 9568 afe18de12b530d8d95a132755516847e8fc3f88e   9.44   -5.93 
L 9576 38ed5ffaf81405a3e8ff094afbc2ada05eaf630c   8.90  -10.79
M 9577 184dcfd570a96328798e36d8e08495eb426d67f9   7.94   -5.42 
N 9582 c186880d7e3651c8a76e55bc981fc22e4a57ea94   7.51   13.85 
O 9587 b87f410b4084c84242c254c70c379b96c8fca2d7   8.56  -12.62
P 9892 12e2eafa38a9932bb136879374423de133075a4a   7.50    8.40  
Q 9897 c79b0ee1dbc1cbbcbe406abd4bc47f617ed5cb2c   8.13   -8.12 
R 9902 0c5a6d8c142371b84542ab9ea175501f02cfb9bd   7.43   -8.88 

The columns in this table are as follows:

identifier
commit id
commit hash
execution time in seconds
percentage change in time for previous release (positive indicates slower).

The three lines in the chart are the min/max/mean of three runs of the command.  If there was >0.5 second between min and max that data was discarded and three new runs were performed.

The change in resolution of the data (i.e. for about 100 points after 'B' on the chart) is caused by a different sampling frequency in areas where the timing doesn't change (I stepped through the commits by 25 commits and then by 1 commits in areas where changes occurred).  

I can't explain the "ringing" that occurs, i.e. at points D, E, & F.
Comment 11 Henry Stiles 2014-07-29 09:38:23 UTC
Hello Marcos, it would be useful to include timings for 9.14, 9.06 and 9.05 to investigate Norbert's claim in Comment #2 that 9.14 < 9.06 < 9.05. where '<' is slower than.
Comment 12 Marcos H. Woehrmann 2014-07-29 10:32:52 UTC
Here are the times for the released versions of GhostPCL.  All times are in seconds and the average of 3 runs:

9.05: 7.84
9.06: 7.60
9.10: 9.57
9.14: 7.42

These results are consistent with what I posted in comment 1.
Comment 13 norbert.janssen 2014-07-30 04:17:47 UTC
(In reply to Marcos H. Woehrmann from comment #12)
> Here are the times for the released versions of GhostPCL.  All times are in
> seconds and the average of 3 runs:
> 
> 9.05: 7.84
> 9.06: 7.60
> 9.10: 9.57
> 9.14: 7.42
> 
> These results are consistent with what I posted in comment 1.

Perhaps the difference I see can come from the fact that I'm running from Win7 (OS64bit, pcl6.exe=32bit). You are probably running on Linux.(In reply to Marcos H. Woehrmann from comment #12)
> Here are the times for the released versions of GhostPCL.  All times are in
> seconds and the average of 3 runs:
> 
> 9.05: 7.84
> 9.06: 7.60
> 9.10: 9.57
> 9.14: 7.42
> 
> These results are consistent with what I posted in comment 1.

Could it be that the difference in timing comes from Win7 (OS64bit, pcl6.exe -32bit) versus Linux?
Comment 14 Henry Stiles 2014-07-30 11:53:56 UTC
I imagine it must be a platform difference.  I assume you are testing our code without your modifications or the UFST, please confirm, and I'll look some more.

(In reply to norbert.janssen from comment #13)
> (In reply to Marcos H. Woehrmann from comment #12)
> > Here are the times for the released versions of GhostPCL.  All times are in
> > seconds and the average of 3 runs:
> > 
> > 9.05: 7.84
> > 9.06: 7.60
> > 9.10: 9.57
> > 9.14: 7.42
> > 
> > These results are consistent with what I posted in comment 1.
> 
> Perhaps the difference I see can come from the fact that I'm running from
> Win7 (OS64bit, pcl6.exe=32bit). You are probably running on Linux.(In reply
> to Marcos H. Woehrmann from comment #12)
> > Here are the times for the released versions of GhostPCL.  All times are in
> > seconds and the average of 3 runs:
> > 
> > 9.05: 7.84
> > 9.06: 7.60
> > 9.10: 9.57
> > 9.14: 7.42
> > 
> > These results are consistent with what I posted in comment 1.
> 
> Could it be that the difference in timing comes from Win7 (OS64bit, pcl6.exe
> -32bit) versus Linux?
Comment 15 Ken Sharp 2014-07-30 12:08:01 UTC
(In reply to Henry Stiles from comment #14)
> I imagine it must be a platform difference.  I assume you are testing our
> code without your modifications or the UFST, please confirm, and I'll look
> some more.

If needed I can run on Windows 7, 64-bit and do some tests
Comment 16 norbert.janssen 2014-07-31 04:43:13 UTC
(In reply to Ken Sharp from comment #15)
> (In reply to Henry Stiles from comment #14)
> > I imagine it must be a platform difference.  I assume you are testing our
> > code without your modifications or the UFST, please confirm, and I'll look
> > some more.
> 
> If needed I can run on Windows 7, 64-bit and do some tests

I indeed did the timings with just the plain build for pcl6_msvc.mak (no changes, except 1 line added to plmain.c(a pl_print_usage(&inst, "Finished: "), just befor the /* -- End Main loop -- */.
The OS is Win7 64bit, but the pcl6.exe is VS2012/VS2013 build for win32.
Comment 17 Henry Stiles 2014-08-18 17:22:26 UTC
(In reply to norbert.janssen from comment #2)
> I know 9.07-9.10 were slower, but I also see that 9.14 is 4% slower than
> 9.06 and 9.06 is 3% slower than 9.05
> I did run with the following commandline:
> -sDEVICE=bit -r600 -dNOPAUSE -dLeadingEdge=3 -sOutputFile=nul 25pages.pcl
> 
> It could be that the LeadingEdge=3 (equal to LongEdgeFeed paper) is causing
> this? But that is the mode we normally is it in our controller.
> 
> What I find strange is that the number of calls to show_proceed is 20% up in
> 9.14 w.r.t. 9.06 (see the profile-data on peeves).

The number of calls to show_proceed in 9.06 and 9.14 is 172,637.  There must be as many calls to show_proceed as there are characters the interpreter attempts to render.  It might be possible to have the same output and a different number of calls to show_proceed if some character were clipped (I'm assuming you are seeing the same result in 9.06 vs. 9.14).  Can you verify there is a call count difference by adding a global counter to show_proceed in each release and printing that value at the end of the pcl job.  Perhaps this is just some bug with the profiler sending us off on the wrong track.
Comment 18 norbert.janssen 2014-08-26 06:58:06 UTC
(In reply to Henry Stiles from comment #17)
> (In reply to norbert.janssen from comment #2)
> > I know 9.07-9.10 were slower, but I also see that 9.14 is 4% slower than
> > 9.06 and 9.06 is 3% slower than 9.05
> > I did run with the following commandline:
> > -sDEVICE=bit -r600 -dNOPAUSE -dLeadingEdge=3 -sOutputFile=nul 25pages.pcl
> > 
> > It could be that the LeadingEdge=3 (equal to LongEdgeFeed paper) is causing
> > this? But that is the mode we normally is it in our controller.
> > 
> > What I find strange is that the number of calls to show_proceed is 20% up in
> > 9.14 w.r.t. 9.06 (see the profile-data on peeves).
> 
> The number of calls to show_proceed in 9.06 and 9.14 is 172,637.  There must
> be as many calls to show_proceed as there are characters the interpreter
> attempts to render.  It might be possible to have the same output and a
> different number of calls to show_proceed if some character were clipped
> (I'm assuming you are seeing the same result in 9.06 vs. 9.14).  Can you
> verify there is a call count difference by adding a global counter to
> show_proceed in each release and printing that value at the end of the pcl
> job.  Perhaps this is just some bug with the profiler sending us off on the
> wrong track.

Back from a 4 weeks vacation ;)

I added a char_counter to show_proceed and I indeed get the same number of calls for show_proceed (both 9.06 and 9.14 172637 calls). So indeed it's strange that the FunctionSummary.csv on 9.06 gives a different count for show_proceed.

Doing the same for copy_portrait (according to *.csv 9.06 = 2568552, 9.14 = 3084317) this gives 9.06 = 5223352, 9.14 = 5001817 calls (so 4% less for 9.14, which was 4% slower than 9.06).

So it seems that even the profiler with instrumentation gives incorrect information and the profiler is indeed sending us off on the wrong track.

I need to reconsider on why I see a different trend in the timings for 9.04, 9.05, 9.06, 9.14 than you see.
Ps. I did not consider 9.07 timings because we did not use it in our product. And 9.10 was definitely worse in performance (due to new fapi which missed some caching?).
Comment 19 Chris Liddell (chrisl) 2014-08-26 07:28:42 UTC
(In reply to norbert.janssen from comment #18)
> I did not consider 9.07 timings because we did not use it in our
> product. And 9.10 was definitely worse in performance (due to new fapi which
> missed some caching?).

There were several reasons the new code in 9.10 was considerably slower - mainly because I had explicitly focused that first release on getting the output as correct as possible, rather than on performance.
Comment 20 norbert.janssen 2014-08-27 01:45:01 UTC
I like to investigate a different approach:
a) -dNumRenderingThreads (multithreaded rendering in bands)
b) deferred page rendering: interpreting next pages, while rendering current page) i.e. use a separate thread (not interpreter thread) to do the rendering.

however: a) gives worse performance.
obj/pcl6 -sDEVICE=bit -r600 -sOutputFile=nul -dNOPAUSE 25pages.pcl
=> 9.96 secs (for 47 pages).

obj/pcl6 -sDEVICE=bit -r600 -sOutputFile=nul -dNOPAUSE -dMaxBitmap=0 -dBandBufferSpace=4000000 -dBandHeight=1500 -dNumRenderingThreads=5 25pages.pcl
=> 13.79 secs (for 47 pages).

playing with BandHeight/NumRenderingThreads does not really help.
It looks like the (render)threads are not running in parallel. (windows task manager still shows only CPU corresponding to 1 core, i.e. on an 8 core just 13%)
can it be that the threads are locking each other (competing for resources/locks)?

can you confirm this?
Comment 21 norbert.janssen 2014-08-27 02:06:12 UTC
Extra info:
I did same for latest trunk, and it seems even slower (trunk=14.78 secs versus 9.14=13.36 secs) when doing multithreaded rendering.
using:
obj/pcl6 -sDEVICE=bit -r600 -dMaxBitmap=0 -dBandBufferSpace=4000000 -dBandHeight=1500 -dNumRenderingThreads=5 -sOutputFile=nul -dNOPAUSE 25pages.pcl

with: obj/pcl6 -sDEVICE=bit -r600 -sOutputFile=nul -dNOPAUSE 25pages.pcl
I get: trunk=9.77 sec, 9.14=9.95 sec (so slightly faster)
Comment 22 Ray Johnston 2014-09-06 09:41:19 UTC
On my Windows 7 Pro laptop, i7 2640M 2.8GHz, I am seeing the following:

NRT    sec.
 0    19.6
 1    18.6
 2    18.2
 3    18.3
 4    18.5

Monitoring with task manager, the "Performance" tab shows a very constant 25%,
which is consistent with Norbert's experience in that regard.

I am adding higher accuracy timing functions to the rendering threads (based
on "QueryPerformanceCounter" in order to analyze the various thread level
"wait" times during mutex and semaphore waits to try and analyze these issues.
Comment 23 Henry Stiles 2014-09-09 07:00:20 UTC
Best to have a bug per problem.  It looks like this has drifted out of my domain.  Assigning to Ray.
Comment 24 norbert.janssen 2014-09-18 10:47:19 UTC
(In reply to norbert.janssen from comment #21)
> Extra info:
> I did same for latest trunk, and it seems even slower (trunk=14.78 secs
> versus 9.14=13.36 secs) when doing multithreaded rendering.
> using:
> obj/pcl6 -sDEVICE=bit -r600 -dMaxBitmap=0 -dBandBufferSpace=4000000
> -dBandHeight=1500 -dNumRenderingThreads=5 -sOutputFile=nul -dNOPAUSE
> 25pages.pcl
> 
> with: obj/pcl6 -sDEVICE=bit -r600 -sOutputFile=nul -dNOPAUSE 25pages.pcl
> I get: trunk=9.77 sec, 9.14=9.95 sec (so slightly faster)

Note that the times measured for ghostpdl-9.14 of 9.95sec is without clist, so the extra time for the 13.36secs is probably the overhead needed for the clist-creation (in memory) and the thread-creation overhead.
From previous investigations (2-3 years ago), I know that I needed at least 3 threads to get below the 9.95 secs again.
This was with windows7; but at that time we had win-xp as platform, and there the performance became worse with more threads, so was no solution for us.
Comment 25 Ray Johnston 2014-09-23 08:06:03 UTC
The timing of 25pages.pcl (47 pages) is pretty much as expected.

Page mode is faster than clist mode with NumRenderingThreads=0 and using
NRT > 0 helps the rendering, up to the point where the overhead of using more
threads swamps the time each thread spends rendering.

I added 'parse done' time to the PCL code and note that when the total time
for the 47 pages is 15 seconds, the total time spent rendering with NRT=0
was 2.3 seconds (12.2 seconds were spent parsing, i.e. writing the clist).

Using NRT=2 (on my core i7 laptop) gets the overall time down to 13.5 seconds
by reducing the render time total to .65 seconds. The page mode time for this
file was 12.7 seconds.

The bottom line is that there isn't enough "work" to be done when rendering
this file to recoup the overhead time of creating the clist.

Closing this a WORKSFORME since it is working as expected, and there isn't
anything "funny" going on with multi-threaded rendering.