Bug 695778 - Long delay converting .ps file to .pdf file
Summary: Long delay converting .ps file to .pdf file
Status: RESOLVED FIXED
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Writer (show other bugs)
Version: 9.14
Hardware: PC Linux
: P4 normal
Assignee: Ken Sharp
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-01-07 09:04 UTC by madbiologist
Modified: 2015-01-13 07:03 UTC (History)
1 user (show)

See Also:
Customer:
Word Size: ---


Attachments
Original djvu file (6.24 MB, image/vnd.djvu)
2015-01-07 09:04 UTC, madbiologist
Details
PostScript file, compressed (13.63 MB, application/x-bzip)
2015-01-07 09:39 UTC, Till Kamppeter
Details

Note You need to log in before you can comment on or make changes to this bug.
Description madbiologist 2015-01-07 09:04:39 UTC
Created attachment 11405 [details]
Original djvu file

While investgating the problem of evince taking an hour and twenty minutes to print the attached djvu file, it was discovered that if it is printed to a PS file it prints quite quickly, but then using ps2pdf to convert the .ps file to a .pdf file takes 1 hour and 17 minutes.

Running the "top" command while the ps2pdf conversion is in progress shows that gs from the ubuntu USER is at the top of the list with 100% use of one of my four CPU cores (single CPU state/separate CPU states can be toggled by pressing "1" while top is running). SHaRed memory was fixed at 3776 KiB, while VIRTual memory size and RESident memory size steadily increased to over 100000 KiB each after 10 minutes and continued to steadily increase to over 200000 KiB before the conversion completed after 1 hour and 17 minutes.

Unless I am mistaken, gs is ghostscript.This occurs on Ubuntu 14.04 "Trusty Tahr" with ghostscript 9.10~dfsg-0ubuntu10.2 and on Ubuntu 14.10 "Vivid Vervet" with ghostscript 9.14~dfsg-0ubuntu4.
Comment 1 madbiologist 2015-01-07 09:20:24 UTC
The resulting .PS file is too big to attach here, so you will have to generate it using Evince.

Some further comments by Till Kamppeter:

"Sending the PostScript file directly to a native PostScript printer, for
example using the command

nc -w1 <IP address of printer> 9100 < Grammar\ 4.ps

leads to a printout in a reasonable time (2-3 pages quickly one after
each other, then 5 seconds pause, 2-3 pages again and so forth).

This means that the PostScript file produced by evince is awkward, but GhostScript is really TOO slow, meaning that there is room for improvement/fixing in Ghostscript.

Independent of this, the djvu software used by evince needs improvement,
too, first it should generate better, easier to process PostScript, and
second it should be able to directly convert djvu into PDF, as PDF is
THE standard format for printable documents under all operating systems
(please consider reporting an upstream bug here, too."
Comment 2 Ray Johnston 2015-01-07 09:25:48 UTC
We need either a PS file or a PDF file. This file type is not supported.

If you can't make the PS file you test with available, we will have to close
this as INVALID.
Comment 3 Till Kamppeter 2015-01-07 09:39:42 UTC
Created attachment 11406 [details]
PostScript file, compressed

I am attaching the mentioned PostScript file, bzip2-compressed to be accepted by the server.
Comment 4 Ken Sharp 2015-01-12 09:22:44 UTC
The PostScript is pretty nearly a pathological case for pdfwrite. It seems that for every page a new (type 3) font is created, and at the end of the page specifically discarded (this is almost unheard of in PostScript programming). The page, which seems to be originally a bitmap, is then reconstructed by drawing each 'glyph' (in reality a bitmap). This includes all the page 'furniture' such as boxes and images, as well as actual text.

The glyphs are shown using the 'glyphshow' operator, which is ordinarily a rarely used operator (though this is the second Linux application which makes extensive use of it that I've seen). Basically this is laziness on the part of the PostScript producer. Rather than produce properly encoded fonts and use the various show operators, they just pull glyphs directly from a huge font.

Now for PostScript that's fine, and although its lazy and ugly it will work. The problem for PDF is that there is *NO* equivalent to glyphshow in PDF. This means comparisons against PostScript rendering aren't useful.

The basic problem is that fonts in PDF *must* be accessed by an Encoding which limits them to 255 glyphs, while the glypshow operator can use arbitrarily large fonts. So we need to create multiple PDF fonts to reproduce the PostScript usage. I see approximately 1500 fonts being created for 150 pages.

Now because glypshow is an 'unusual' operator, we start by assuming that we are capturing an ordinary type 3 font use, which result in us creating a CharProc with no ID. Later (because we discover this is a glyphshow) we have to delete that CharProc (and create a new one in a different font). Unfortunately, because we created it without an ID, we have to search the entire list of stored CharProcs to find the one we want to delete from that list. As that list grows longer (and remember, we have 1500+ fonts to look at), the search time grows longer and longer. A few pages proceed quickly, middling numbers start to get slow (from about 10 pages or so) and beyond about 50 or so pages it starts to get excruciating. Profiling the code the majority of the time is spent in a loop searching for the CharProc we want to delete.

I have a potential solution which works by adding a temporary ID to the CharProc, and then short-circuiting the search. This improves the time take to run the entire file from ~84 minutes to ~10 minutes. Checking against Adobe Acrobat I find that it takes ~9 minutes for the same task.

The 'short-circuit' check should have been in place already and wasn't, so 
I want to look at some more places in the code which might benefit from the same optimisation. Assuming that I don't also find any problems I will commit this change tomorrow.

Given the (horrbile) nature of the PostScript I don't see any scope for further improvement in performance with this file.
Comment 5 Ray Johnston 2015-01-12 09:35:37 UTC
Just a side comment. Since djvu does text with glyphshow rather than doing
OCR and producing a valid encoding, then using show, the text in the
resulting PDF is totally unsearchable and copy/paste from the PDF will be
garbage.

I don't know if this djvu file has the optional "OCR" layer, but if so, it
isn't making it into the PostScript, since that would have to use show and
define the appropriate Encoding.
Comment 6 Till Kamppeter 2015-01-12 12:09:02 UTC
Madbiologist, if you have not yet done so, please also report a bug to djvu upstream (or whatever software turns djvu to PS in evince) asking for making less awkward PS output and also direct PDF output.
Comment 7 Ken Sharp 2015-01-13 03:23:18 UTC
Commit 3e7115492c378ffa324c0a083244a785a6a61f82 addresses this issue as described in comment #4. Reviewing the other cases didn't reveal any other obvious instances which needed addressing.

For me this is now about 8 times faster than it was, and is broadly the same speed as Acrobat. I don't see any scope for significant improvement on this.
Comment 8 madbiologist 2015-01-13 06:37:42 UTC
the top utility seems to show that evince itself turns djvu to PS, although the printing to a PS file happens fairly quickly so I might have missed it.  I have filed https://bugzilla.gnome.org/show_bug.cgi?id=742559 about evince producing an awkward/nearly pathological PS file.

I have filed https://bugzilla.gnome.org/show_bug.cgi?id=742561 for evince being unable to directly print/convert a djvu file to a PDF file.
Comment 9 madbiologist 2015-01-13 06:53:57 UTC
Thanks Ken for the quick fix.  Although it is unfortunate to have to wait 10 minutes for a document to convert/print, it is bearable.  I'll take an 88% speed improvement any day :)  Hopefully someone improves evince's PS output as quickly as you have improved ghostscript's ability to deal with it.  Well done, and thanks.

Will this patch be in Ghostscript 9.16?  And when is Ghostscript 9.16 scheduled for release?
Comment 10 Ken Sharp 2015-01-13 07:03:39 UTC
(In reply to madbiologist from comment #9)
> Thanks Ken for the quick fix.  Although it is unfortunate to have to wait 10
> minutes for a document to convert/print, it is bearable.

Well performance is now broadly similar to Acrobat Distiller, and a profile doesn't reveal any other hot spots, so I tend to conclude that's probably about as good as it gets for files of this nature. It is faster to print to a PostScript printer, the fact that PDF has no equivalent of glyphshow is the major problem.


> speed improvement any day :)  Hopefully someone improves evince's PS output
> as quickly as you have improved ghostscript's ability to deal with it.  Well
> done, and thanks.

No problem, its always nice to do performance improvements, the fact that the routine to clean up resources was checking all resources of that type, even after it had found its target (!) was obviously a long-standing bug.....

> 
> Will this patch be in Ghostscript 9.16?

Unless some reason crops up to remove it before release. Its such a big gain for this kind of file I'd try hard to find a work-around even if it does break something (which I consider unlikely).


>  And when is Ghostscript 9.16
> scheduled for release?

GS is released at approximately 6 monthly intervals, the last release was September 2014, so the next release should be in March this year.