Bug 702914 - Font caching and performance when embedding Fonts
Summary: Font caching and performance when embedding Fonts
Status: RESOLVED INVALID
Alias: None
Product: Ghostscript
Classification: Unclassified
Component: PDF Writer (show other bugs)
Version: 9.52
Hardware: PC Windows 7
: P4 normal
Assignee: Ken Sharp
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-09-17 11:38 UTC by Hakan
Modified: 2021-08-17 15:50 UTC (History)
2 users (show)

See Also:
Customer:
Word Size: ---


Attachments
multipage input file (4.31 MB, application/pdf)
2020-09-17 11:38 UTC, Hakan
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Hakan 2020-09-17 11:38:48 UTC
Created attachment 19835 [details]
multipage input file

Hello,

My question is if something can be done to improve Font embedding performance when using pdfwrite device with more intelligent caching of available Font resources.

Tests below on Windows, 64bit.

This command line executes in *58* seconds and has a lot of warnings that a font cannot be found, the same font file over and over on each page - until it is substituted intelligently

C:\Program Files\gs\gs9.52\bin>gswin64c.exe -dNOSAFER -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -dEmbedAllFonts=true -sFONTPATH="%windir%/fonts;" -dPDFSETTINGS=/prepress -dPassThroughJPEGImages=true -sOutputFile="c:\temp\output.pdf" "c:\temp\input.pdf"

when I remove the -sFONTPATH for testing purposes, the command below finishes in *18* seconds on the same environment

C:\Program Files\gs\gs9.52\bin>gswin64c.exe -dNOSAFER -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -dEmbedAllFonts=true -dPDFSETTINGS=/prepress -dPassThroughJPEGImages=true -sOutputFile="c:\temp\output.pdf" "c:\temp\input.pdf"

In this case Build-In Fonts like Helvetica and Times Roman are used instead of TrueType Fonts from the Windows system.  To a degree this is fine but there are many cases where Non-Standard Fonts, present in the %windir%/fonts folder must be loaded.

My question is if the caching and embedding of those Font resources is as efficient as it could be.

The Test file in this example is just representative. I have many other files where Font embedding with Windows Fonts take 5 to 10 Minutes or more.

The intelligence of finding substitutes and the actual embedding itself work very well, my post is only asking about a possible performance improvement because I deal with very large files and this part proves to be the bottleneck.

I have prototype code with iText that does such font embedding, in the range of milliseconds to a few seconds for the same files, but it is not as reliable as Ghostscript when 'strange' pdfs are used - it fails to replace some font streams, but when it works, its faster on a factor of 10 or more. So I am guessing that there is some room for improvement in the GS Font Embedding Code without sacrificing quality and reliability.

In case it matters, my interest is limited to Windows (TTF) and Mac fonts and not the more exotic font styles. I know the PDF format can host more than 1 type of font and it is a complex issue for itself. 

Thank you
Comment 1 Peter Cherepanov 2021-01-02 02:47:40 UTC
I confirm that caching of font resources in Ghostscript is limited to Type 1 fonts, although it appears to work for TrueType fonts too. gs_fonts.ps decides whether the font is cached as:
/FontType .findfontvalue { 1 eq } { //false } ifelse
My testing shows the following times:
all fonts global, no FONTPATH 13.8 s
all fonts local, no FONTPATH 18.3 s
all fonts global, FONTPATH 16.6 s
all fonts local, FONTPATH 46.2
as is, no FONTPATH 14.0 s
as is, FONTPATH 46.0 s
Comment 2 Ken Sharp 2021-08-14 09:04:10 UTC
(In reply to Hakan from comment #0)

 
> C:\Program Files\gs\gs9.52\bin>gswin64c.exe -dNOSAFER -dBATCH -dNOPAUSE
> -sDEVICE=pdfwrite -dEmbedAllFonts=true -sFONTPATH="%windir%/fonts;"
> -dPDFSETTINGS=/prepress -dPassThroughJPEGImages=true
> -sOutputFile="c:\temp\output.pdf" "c:\temp\input.pdf"
> 
> when I remove the -sFONTPATH for testing purposes, the command below
> finishes in *18* seconds on the same environment
> 
> C:\Program Files\gs\gs9.52\bin>gswin64c.exe -dNOSAFER -dBATCH -dNOPAUSE
> -sDEVICE=pdfwrite -dEmbedAllFonts=true -dPDFSETTINGS=/prepress
> -dPassThroughJPEGImages=true -sOutputFile="c:\temp\output.pdf"
> "c:\temp\input.pdf"
> 
> In this case Build-In Fonts like Helvetica and Times Roman are used instead
> of TrueType Fonts from the Windows system.  To a degree this is fine but
> there are many cases where Non-Standard Fonts, present in the %windir%/fonts
> folder must be loaded.
> 
> My question is if the caching and embedding of those Font resources is as
> efficient as it could be.
> 
> The Test file in this example is just representative. I have many other
> files where Font embedding with Windows Fonts take 5 to 10 Minutes or more.

You are asking Ghostscript to search the entire Windows Fonts directory for every instance of a missing font. Obviously this takes time. Lots of time, depending on how many fonts you have. Each font file must be opened and checked to find the font name, if you have hundreds of font files, and many missing fonts, then this process will take place many, many times.

Instead of searching the entire font directory, you could create a fontmap which lists a font name and a substitute font file to be used. Obviously this is much quicker to process.

But if you want to leave Ghostscript doing all the work for you, then you're going to have to accept the performance hit.

I don't see this as a bug, and there is already scope for the user to reduce the overhead.
Comment 3 Ken Sharp 2021-08-14 09:10:46 UTC
I forgot to mention that because fonts and other resources in PDF are looked up using the object number, whereas in PostScript they are referenced by name, we are forced to flush all fonts between pages in the current PDF interpreter.

This is in order to ensure that we get the correct font when PDF producers use non-unique names for the fonts in a PDF file, and don't end up reusing a font referenced from a prior page when a subsequent page uses a font with a different object number, but the same name.
Comment 4 Ray Johnston 2021-08-14 17:47:50 UTC
As Ken says, you don't get anything for free, but you _can_ pre-load fonts
before running the PDF so that the save/restore performed for each page doesn't
"forget" the font.

I used the following in a file "preload.ps":

[ /Arial-BoldItalicMT   /Arial-BoldMT   /Arial-ItalicMT   /ArialMT
  /ArialNarrow   /ArialNarrow-BoldItalic   /CourierNewPSMT 
  /TimesNewRomanPS-BoldMT   /TimesNewRomanPSMT   /Verdana
]
{ findfont pop } forall

so the command line ends with "c:\temp\preload.ps" "c:\temp\input.pdf"

and with -sFONTPATH="C:/Windows/Fonts" it only scans the FONTPATH once which
on my system, with a debug build, then only loads the fonts once. The scan
of my Windows fonts reports:
  Scanning C:/Windows/Fonts for fonts... 1074 files, 882 scanned, 838 new fonts
then it loads the fonts BEFORE page 1.

With -sFONTPATH is takes 40 sec (versus 55 sec without the preload), and
without the preload or -sFONTPATH is takes 29 sec.

Note that without the -sFONTPATH and with the preload it takes 25 sec. This is
probably due to the processing of the TTF vs. Type 1 fonts.

Also note that even with my Windows/Fonts directory which has 1074 files that
has 838 fonts that Ghostscript can use, this ALWAYS only happens once (even
without preload.ps) and only takes 0.86 sec.
Comment 5 Hakan 2021-08-15 08:45:43 UTC
Hello, 

thank you all for the comments. 

(In reply to Ken Sharp)
> I don't see this as a bug

I agree that this post is not a bug. That's why my post starts with...

My question is if something can be done to improve Font embedding performance when using pdfwrite device with more intelligent caching of available Font resources.

(In reply to Ken Sharp)
>You are asking Ghostscript to search the entire Windows Fonts directory for >every instance of a missing font. Obviously this takes time. Lots of time, >depending on how many fonts you have. Each font file must be opened and >checked to find the font name, if you have hundreds of font files, and many >missing fonts, then this process will take place many, many times.
>

I was hoping to find a way to do all or as much as possible of the time consuming font file/name lookup work in advance and cache the results because fonts do not get installed every day. The user knows when new fonts are installed and it is possible to detect font changes via code too. 

>Instead of searching the entire font directory, you could create a fontmap >which lists a font name and a substitute font file to be used. Obviously this is much quicker to process.

This is exactly what I was hoping to do. I don't know the steps to create a fontmap but I would be very happy if I could trigger a GS command that reads the entire Windows Font directory or the users paths given and creates a fontmap/cached file. The purpose of this preprocessing stage would be only to create a fontmap. This step would be done only on demand, or something like once a week, or when it is known that system fonts changed.

At the actual processing stage, I would like to pass in a link to the cached fontmap file - guiding GS to pull all font filenames instead of reading the windows directory again.

Is there a URL that guides me to documentation how fontmap files are created and consumed ? Or can someone kindly post an example here ?

(In reply to Ray Johnston)

Thank you for the tip with "preload.ps"

>Also note that even with my Windows/Fonts directory which has 1074 files >that
>has 838 fonts that Ghostscript can use, this ALWAYS only happens once (even
>without preload.ps) and only takes 0.86 sec.

I did not understand what command line or process takes only 0.86 sec. Is that an internal process ?
Comment 6 Ken Sharp 2021-08-15 12:06:08 UTC
(In reply to Hakan from comment #5)

> >Instead of searching the entire font directory, you could create a fontmap >which lists a font name and a substitute font file to be used. Obviously this is much quicker to process.
> 
> This is exactly what I was hoping to do. I don't know the steps to create a
> fontmap but I would be very happy if I could trigger a GS command that reads
> the entire Windows Font directory or the users paths given and creates a
> fontmap/cached file. The purpose of this preprocessing stage would be only
> to create a fontmap. This step would be done only on demand, or something
> like once a week, or when it is known that system fonts changed.
> 
> At the actual processing stage, I would like to pass in a link to the cached
> fontmap file - guiding GS to pull all font filenames instead of reading the
> windows directory again.
> 
> Is there a URL that guides me to documentation how fontmap files are created
> and consumed ? Or can someone kindly post an example here ?

https://www.ghostscript.com/doc/9.54.0/Fonts.htm#About

Or the doc directory of the Ghostscript installation.

Ghostscript comes supplied with an example fontmap.GS and cidfmap, both of which contain comments documenting their use and syntax.
Comment 7 Ray Johnston 2021-08-15 16:27:21 UTC
Sorry if my timing information was not clear.

To get the timing of the initial steps, I added "usertime =" lines to the
preload.ps. With a 'release' build on my laptop, scanning takes 0.25 seconds
and then loading the 10 selected fonts takes another 0.14 seconds.

With the release build, the entire run takes 13.9 seconds with the preload.

Without the preload on my laptop it takes 21.9 seconds.

This 0.39 seconds is the only time that the creation of a Fontmap would save
since the extra time (when 'preload' is not used) is actually caused by the
repeated reloading of the fonts as pages are processed. The FONTPATH is only
scanned ONCE to load all the fonts that can be used into an internal Fontmap
that has the path to the file containing the font for each font.

---- preload.ps with timing info printed ----
%!PS
usertime pop	% start the usertime
/X findfont pop	% load a font that doesn't exist to force scanning FONTPATH
usertime =
[ /Arial-BoldItalicMT /Arial-BoldMT /Arial-ItalicMT /ArialMT
  /ArialNarrow /ArialNarrow-BoldItalic /CourierNewPSMT
  /TimesNewRomanPS-BoldMT /TimesNewRomanPSMT /Verdana
]
{ findfont pop } forall
usertime =
Comment 8 Hakan 2021-08-17 10:28:14 UTC
Following Rays advice did indeed provide the most benefit.

58 seconds without preload on the supplied input file turns to 20 seconds with preload.ps and the correct list of fonts. 

Creating and playing with the fontmap files itself did not provide a significant measurable benefit.

thank you very much
Comment 9 Henry Stiles 2021-08-17 14:41:00 UTC
We've seen several reports from you Hakan, can you tell us how Ghostscript is being used.  What product do you support?
Comment 10 Hakan 2021-08-17 15:50:41 UTC
(In reply to Henry Stiles from comment #9)
> We've seen several reports from you Hakan, can you tell us how Ghostscript
> is being used.  What product do you support?

Hello Henry, I would be happy to answer. I have responded in private via Email.
Best Regards