Bug 690232

Summary: indeterminate rendering of 13-03 and 13-12.PS
Product: Ghostscript Reporter: Ralph Giles <ralph.giles>
Component: Graphics LibraryAssignee: Marcos H. Woehrmann <marcos.woehrmann>
Status: RESOLVED FIXED    
Severity: normal    
Priority: P2    
Version: master   
Hardware: PC   
OS: Linux   
Customer: Word Size: ---
Attachments: thir.tar.bz2 - valgrind logs
memdebug.patch

Description Ralph Giles 2009-01-15 14:38:22 UTC
Since we started the automated cluster regressions, the files ps3cet/13-03.PS
and ps3cet/13-12.PS have shown inconsistent rendering. This can be difficult to
reproduce on a single machine, but tests on the cluster have shown some 12
distinct output files generated by multiple runs.

Since the perpetual occurrence of these file in the cluster regression reports
is distracting from real rendering differences do the code changes, the files
have been temporarily removed from the suite. They should be put back after this
bug is resolved.
Comment 1 Ralph Giles 2009-01-15 14:52:12 UTC
Raising the priority since this affects the quality of our regression testing.

Assigning to Alex for initial investigation.
Comment 2 Alex Cherepanov 2009-01-17 06:25:19 UTC
Created attachment 4729 [details]
thir.tar.bz2 - valgrind logs

Valgrind generates many warnings with this command line

valgrind ./gs/debugobj/gs -r600 -dNOPAUSE -dBATCH -sDEVICE=ppmraw
-sOutputFile=/dev/null - <../13-12.PS >13-12.log 2>&1

but similar warnings occur in other files that don't show indeterminism.
Comment 3 Ralph Giles 2009-01-17 12:07:53 UTC
In case it makes a difference, the actual cluster test command line is:

./bin/gs -q -I$HOME/fonts -dQUIET -dNOPAUSE -dBATCH -K1000000
-dMaxBitmap=30000000 -sDEVICE=ppmraw -r600
-sOutputFile="|md5sum>/tmp/13-12.PS.ppmraw.600.md5" -dJOBSERVER
%rom%Resource/Init/gs_cet.ps - < 13-12.PS

(files outside the ps3cet directory run with -dSAFER and withouth gs_cet.ps)
Comment 4 Ken Sharp 2009-01-18 02:37:33 UTC
I looked at these some time back and concluded that many (possibly all) are due
to single pixel differences on glyphs, usually on curves or flexes. I suspect
that these are differences in hinting, probably caused by uninitialised
variables in the hinting routines.

Ralph did produce a valgrind output for me using the cluster regression setup, I
can mail it to you if you are interested Alex.
Comment 5 Alex Cherepanov 2009-01-18 08:15:57 UTC
I found that on peeves.ghostscript.com an attempt to enumerate fonts
fails and get reported by the test.
(*) {pop} 128 string /Font resourceforall
The same configuration works fine on my home box.

Ken, please compress and attach your Valgrind and stdout logs.
Comment 6 Henry Stiles 2009-01-22 09:31:21 UTC
FWIW, the pcl jobs that fail on the cluster with inconsistent rendering only
occur w/ optimization -O2 (any linux platform).  By that I don't mean a
production build is required, but `make debug XCFLAGS=-O2' is sufficient to
reproduce a problem.
Comment 7 Ray Johnston 2009-01-22 09:42:41 UTC
Assign to Marcos to see if this is related to -O2 
Comment 8 Marcos H. Woehrmann 2009-01-22 16:20:11 UTC
I've spent some time on this and am having trouble reproducing the indeterminism.  I ran 13-03.PS and 
13-12.PS 100 times each on my AMD64 Linux box, with gshead (r9386) and -O2 all md5sums matched.  I 
tried the same thing on my iMac and those matched as well.

What's the frequency of different results on the cluster?  Are the differences confined to some particular 
page?
Comment 9 Ralph Giles 2009-01-22 18:48:43 UTC
I had the same problems reproducing it locally, although I often got different
results on different machines. Did the md5sum match between your mac and linux
systems?

It's easy to reproduce even on a single node of the cluster, however. From 100
runs on red node n122118:

$ for i in `seq 100`; do bin/gs -q -sDEVICE=ppmraw -r600 -o
"|md5sum>>13-03.PS.md5" %rom%Resource/Init/gs_cet.ps
~/tests_private/ps/ps3cet/13-03.PS.disabled; done
$ cat 13-03.PS.md5 | sort | uniq
1a74a02bbb33248d86031e712fc92436  -
247c7e29985595dc6da379326f763bdc  -
33975a3aa80fb37673242c72b8832b01  -
b2c76fbd90fb79b6ae579fb59ea3f6d5  -
b6c5533d48e99f6801be272f58906769  -
ff03b1911c4093258f1f453617b454cd  -
Comment 10 Marcos H. Woehrmann 2009-01-30 20:58:21 UTC
I still haven't found problem this but have spent so much time on it that I felt obligated to list my 
findings:

Optimization doesn't seem to affect this problem (-O2 and -O0 fail at the same rate).

At 600 DPI the problem is evident, at 300 DPI it also occurs but very infrequently.

The only differences in the output appears to be in glyph rendering.  I've added lots of debugging 
statements to gstype1.c and even when the output is identical different paths are sometimes taken 
through that code.

I can't get this to happen on my Linux box (amd64) nor my iMac (Intel).  

I've been concentrating on the file 13-03.PS.  I've edited this file and disabled several of the pages:
> /DoDefaultFlag false def
> /DoSyntaxFlag true  def
> /DoRangeFlag false def
> /DoIllegalFlag false def
> /DoGstateFlag false def
> /DoSpecialFlag false def

This reduces the number of pages generated to 4.  The later the page the more differences (page 4 has 
the most differences, page 1 never has differences).  If I set DoSyntaxFlag to false the test is reduced to 
3 pages (pages 2, 3, and 4), none of which ever show any differences.

I believe the problem is somehow related to memory management (perhaps garbage collection).  I've 
been experimenting with command line parameters and if I use the following:

-dBufferSpace=50000000 -dMaxBitmap=50000000 -c "20000000 setvmthreshold"

the number of differences goes down.  If I make the numbers bigger the frequency of differences go up 
(I haven't isolated which parameter(s) are significant).  Note that the frequency of differences is 
generally small (~10% of the runs show a difference on page 4) but has a reasonably high variability, so 
it's possible that the command line parameter effect isn't real.

It's hard to debug this; running under gdb would be of no help since I won't know if I'm in wrong or 
right case (I'm defining "right" as the common case).  Any suggestions are appreciated.
Comment 11 Ken Sharp 2009-01-31 01:12:49 UTC
> The only differences in the output appears to be in glyph rendering.  
> I've  added lots of debugging 
> statements to gstype1.c and even when the output is identical different paths 
> are sometimes taken 
> through that code.

That in itself sounds like a problem to me, the code path shouldn't be
non-deterministic either. 

Because this is essentially impossible to debug (doesn't fail reliably), my
approach was going to be to fix up the warnings from valgrind and see if
anything got better. I still can't really think of a better solution.
Comment 12 Marcos H. Woehrmann 2009-01-31 09:26:43 UTC
Thanks for the suggestion, but I tried that early on and it didn't make any difference.

I'm running with "-Z@$?", but that doesn't make any difference.
Comment 13 Ray Johnston 2009-01-31 11:07:02 UTC
Created attachment 4772 [details]
memdebug.patch

Besides setting the vmthreshold to a high value, you could also try -dNOGC
to see if that eliminates the differences.

Note that -Z@$? only works for a DEBUG build, unless you modify the code in
base/gsmdebug.h (for -Z@) and psi/ilocate.c (for -Z?) and psi/store.h (for
-Z?).

This patch will enable those debugging flags for a non-DEBUG build.

(ARCH_SIZEOF_PTR is arbitrary -- any 'defined' symbol that is commonly present 

will suffice)
does this.
Comment 14 Marcos H. Woehrmann 2009-02-05 20:58:35 UTC
Based on our discussion I tried the following:

Running on the green cluster instead of red -- no difference
Running the gs binary compiled on my amd64 machine on the cluster -- no difference
Removing the Init/gs_cet.ps before the test file -- no difference

Comment 15 Marcos H. Woehrmann 2009-02-05 21:08:10 UTC
So far I've isolated the problem to the gx_bits_cache_alloc() routine in gxbcache.c (it's not in this code, 
but this is the code that exhibits the indeterminism).  Here's the relevant portion of the code (with lots 
of debug statements):

if_debug0('k', "[k]mhw 53\n");
    while (fsize < ssize1 && fsize != ssize) {
if_debug4('k', "[k]mhw 54 %ld %ld %ld %ld\n",(long) fsize,(long) ssize1,(long) fsize,(long) ssize);
        if (!cb_head_is_free(cbh_next)) {       /* Ask the caller to free the entry. */
            if (fsize)
                cbh->size = fsize;
            *pcbh = cbh_next;
            return -1;
        }
        fsize += cbh_next->size;
if_debug1('k', "[k]mhw 55  %ld\n",(long) cbh_next->size);
        if_debug2('K', "[K]merging free bits 0x%lx(%u)\n",
                  (ulong) cbh_next, cbh_next->size);
        cbh_next = (gx_cached_bits_head *) ((byte *) cbh + fsize);
if_debug0('k', "[k]mhw 56\n");
    }
if_debug0('k', "[k]mhw 57\n");
    if (fsize > ssize) {        /* fsize >= ssize1 */
if_debug0('k', "[k]mhw 58\n");


And here the two different paths through this code:

[k]mhw 53
[k]mhw 54 0 2284 0 2276
[k]mhw 55  364
[k]mhw 56
[k]mhw 54 364 2284 364 2276
[k]mhw 55  364
[k]mhw 56
[k]mhw 54 728 2284 728 2276
[k]mhw 55  372
[k]mhw 56
[k]mhw 54 1100 2284 1100 2276
[k]mhw 55  380
[k]mhw 56
[k]mhw 54 1480 2284 1480 2276
[k]mhw 55  380
[k]mhw 56
[k]mhw 54 1860 2284 1860 2276
[k]mhw 55  388
[k]mhw 56
[k]mhw 54 2248 2284 2248 2276
[k]mhw 55  396
[k]mhw 56
[k]mhw 57
[k]mhw 58


and


[k]mhw 53
[k]mhw 54 0 2284 0 2276
[k]mhw 55  2996
[k]mhw 56
[k]mhw 57
[k]mhw 58


Comment 16 Henry Stiles 2009-02-06 08:09:30 UTC
So that code shows the font cache is different between runs, but font cache
entries can be relocated because of memory... so this may be "deterministic". 
If that's true, a prediction would be -dNOCACHE should make the problem go away.
 Debug switch -Zm with different runs might tell us why the font order in the
cache is different. 
Comment 17 Marcos H. Woehrmann 2009-02-06 12:00:18 UTC
I haven't tired -dNOCACHE yet but it seems likely that will make a difference since I have discovered that 
changing

  #define bmax_LARGE 500000      /* bmax - space for cached chars */

to

  #define bmax_LARGE 5000000      /* bmax - space for cached chars */

in gsfont.c makes the indeterminism go away. 
Comment 18 Marcos H. Woehrmann 2009-02-06 12:28:57 UTC
Confirmed, -dNOCACHE makes the code deterministic.  Now to figure out why :-)  And why this is an 
issue only on the cluster!
Comment 19 Marcos H. Woehrmann 2009-02-07 23:02:23 UTC
The problem is in gxccman.c; specifically the initializing of pair->hash in gx_add_fm_pair():

    pair->hash = (uint) (ulong) pair % 549;  /* scramble bits */

On my amd64 box the pair pointer is deterministic, on the cluster it's not.  

To demonstrate this I've instrumented the code thus:

    pair->hash = (uint) (ulong) pair % 549;  /* scramble bits */
    if_debug2('k', "[k] mhw 199 %d 0x%08lx\n",pair->hash,(ulong) pair);
    pair->mxx = mxx, pair->mxy = mxy;

If I run the code on my amd64 box multiple times, saving stdout to log.1..log.n, and then:

  foreach i ( log.* )
  grep "mhw 199" $i | head -1
  end

I get:

  [k] mhw 199 523 0x017adc78
  [k] mhw 199 523 0x017adc78
  [k] mhw 199 523 0x017adc78
  [k] mhw 199 523 0x017adc78
  [k] mhw 199 523 0x017adc78

On the cluster the same experiment produces:

  [k] mhw 199 537 0x09ee3524
  [k] mhw 199 391 0x092a5524
  [k] mhw 199 91 0x09c20524
  [k] mhw 199 517 0x098a8524
  [k] mhw 199 505 0x094eb524
  [k] mhw 199 49 0x0a9d9524


Comment 20 Marcos H. Woehrmann 2009-02-09 20:32:28 UTC
Proposed patch:

Index: gsfont.c
===============================================================
====
--- gsfont.c	(revision 9460)
+++ gsfont.c	(working copy)
@@ -269,6 +269,7 @@
     pdir->san = 0;
     pdir->global_glyph_code = NULL;
     pdir->text_enum_id = 0;
+    pdir->hash = 42;
     return pdir;
 }
 
Index: gxfcache.h
===============================================================
====
--- gxfcache.h	(revision 9460)
+++ gxfcache.h	(working copy)
@@ -300,6 +300,8 @@
     uint enum_index;		/* index (N) */
     uint enum_offset;		/* ccache.table[offset] is N'th non-zero entry */
 
+    uint hash;
+
     /* User parameter AlignToPixels. */
     bool align_to_pixels;
 
Index: gxccman.c
===============================================================
====
--- gxccman.c	(revision 9460)
+++ gxccman.c	(working copy)
@@ -333,9 +333,8 @@
 	return code;
     }
     pair->FontType = font->FontType;
-    /* The OSF/1 compiler doesn't like casting a pointer to */
-    /* a shorter int.... */
-    pair->hash = (uint) (ulong) pair % 549;	/* scramble bits */
+    pair->hash = (uint) dir->hash % 549;
+    dir->hash += 371;
     pair->mxx = mxx, pair->mxy = mxy;
     pair->myx = myx, pair->myy = myy;
     pair->num_chars = 0;
Comment 21 Marcos H. Woehrmann 2009-02-10 08:42:43 UTC
*** Bug 690233 has been marked as a duplicate of this bug. ***
Comment 22 Marcos H. Woehrmann 2009-02-10 09:41:27 UTC
Fixed with r9466: http://www.ghostscript.com/pipermail/gs-cvs/2009-February/009043.html
Comment 23 Marcos H. Woehrmann 2009-02-10 10:24:12 UTC
*** Bug 690234 has been marked as a duplicate of this bug. ***