Bug 701945 - infinitely slow rendering of a PDF file
Summary: infinitely slow rendering of a PDF file
Status: RESOLVED FIXED
Alias: None
Product: MuPDF
Classification: Unclassified
Component: mupdf (show other bugs)
Version: 1.16.0
Hardware: PC All
: P4 normal
Assignee: MuPDF bugs
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-11-24 01:27 UTC by Krzysztof Kowalczyk
Modified: 2023-02-19 01:11 UTC (History)
4 users (show)

See Also:
Customer:
Word Size: ---


Attachments
slow rendering of a pdf (3.44 MB, application/pdf)
2019-11-24 01:27 UTC, Krzysztof Kowalczyk
Details
HACKY patch to show performance gains in MuPDF (18.50 KB, application/mbox)
2020-04-21 20:44 UTC, Ger Hobbelt
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Krzysztof Kowalczyk 2019-11-24 01:27:48 UTC
Created attachment 18645 [details]
slow rendering of a pdf

Originally reported against SumatraPDF: https://github.com/sumatrapdfreader/sumatrapdf/issues/1030

Attached PDF is a rather complex, single page vector file but Chrome (i.e. Pdfium i.e. Foxit) renders it reasonably fast (just few seconds). Even pdf.js (https://mozilla.github.io/pdf.js/web/viewer.html) renders it in few seconds.

Mupdf never finishes (I leet it run for minutes). Looks like an an infinite loop somewhere It keeps doing things but never finishes fz_run_display_list.

Here's a profiling sample:

Function Name	Total CPU [unit, %]
 + SumatraPDF.exe (PID: 11464)	5682 (100.00%)
| - fz_draw_stroke_path	5648 (99.40%)
| - fz_run_display_list	5648 (99.40%)
| - fz_stroke_path	5648 (99.40%)
| - PdfEngineImpl::RenderBitmap	5648 (99.40%)
| - RenderCache::RenderCacheThread	5648 (99.40%)
| - fz_knockout_begin	2878 (50.65%)
| - fz_knockout_end	2757 (48.52%)
| - memcpy_repmovs	1871 (32.93%)
| - fz_copy_pixmap_rect	1871 (32.93%)
| - fz_blend_pixmap_knockout	1819 (32.01%)
| - fz_blend_knockout	1817 (31.98%)
| - [External Call] RtlFreeHeap	938 (16.51%)
| - _free_base	938 (16.51%)
| - fz_drop_pixmap_imp	938 (16.51%)
| - fz_drop_storable	938 (16.51%)
| - fz_clear_pixmap	696 (12.25%)
| - memset_repmovs	694 (12.21%)
| - do_scavenging_malloc	311 (5.47%)
| - fz_new_pixmap	311 (5.47%)
| - fz_new_pixmap_with_data	311 (5.47%)
| - _malloc_base	310 (5.46%)
Comment 1 Krzysztof Kowalczyk 2019-11-24 01:34:04 UTC
I take back the "infinite" part. It finished after probably more than 10 minutes.
Comment 2 Ger Hobbelt 2020-04-21 20:43:05 UTC
Some analysis done and results committed here: https://github.com/GerHobbelt/mupdf/commit/165d4088305fefc4880acf6f5535254cd8b83d5c

---

As described in the commit msg below I was looking for ways to make MuPDF page rendering abortable & time limited. the PDF attached by OP has been used for profiling the local build of a (patched) MuPDF codebase, which, I believe, translated 1:1 to the mainline codebase.

The profiling results do not offer a true solution (even though a 30+ minutes run was cut down to deliver in 10 seconds only) but MAY hopefully help the maintainers who understand this code better than I do, what might be worth doing about it.

Copy of the commit message (diff attached as patch file):

----

# Work done re https://bugs.ghostscript.com/show_bug.cgi?id=701945 :: 

(using the profiler (MSVC2019, Win10/64) it turns out that the *big* CPU consumers are the fz_knockout_begin() and fz_knockout_end() calls, clocking in at a lump sum of >92% of total CPU used)



The goal of this work was to find out if (and how) I could get MuPDF to render any pages within a given time frame and, without barfing a hairball via fz_throw(), if such is not possible, to at least render *something*, even though it will be 'corrupt'. 

The goal here is to first check this and make it possible, so I will know where to patch the code properly later on when I want this behaviour:
interruptable and/or time-limited renders which give a **warning** without crashing through a C# .NET interface layer thanks to setjmp/longjmp in fz_try/fz_throw/fz_catch.



To that end, this HACKY CODE HACK has been done:

- first there was the `do_not_edit` flag in the major CPU guzzling 
  routine fz_run_display_list() @ >98% CPU.

- then I added code to turn that flag on after a bit of time has elapsed. 

  **Not** to hammer the OS and eat up lots of CPU myself, the time check 
  was limited to once per 100+ progress points: 

  `if (prev_progress + 100 <= progress && !do_not_draw)`

- then I checked which costly calls could be safely skipped.

  + turns out, alas, that simply setting the `clipped` flag, while it
    *looked* nice on the surface, immediately started to bomb the code 
    with fz_throw() exceptions due to failing checks, so that was/is 
    a no-go.

  + turns out anything that's a group and thus expected to maintain 
    some sort of push/pop stack, is out of the question for a direct 
    skip in the big switch() statement.

  + as over 90% of the CPU cost was expended in fz_fill_path() and 
    fz_stroke_path() functions and more specifically in the
    fz_knockout_begin/fz_knockout_end calls there-in, it was quickly
    attempted to hack the triggering BLENDMODE (bad idea, not just 
    because the struct is undefined there, but also because a patched
    blendmode would affect other code chunks as there's plenty other 
    checks for blendmode elsewhere in the code!)

- after having tested with the `do_not_draw` variable as a *flag* to 
  turn off fz_fill_path() and fz_stroke_path() entry, the next step 
  was to tweak:

  the `do_not_edit` flag was turned in a counter and a simple 
  modulo code (as shown in this commit) was to ensure at least *some* 
  of the elements were rendered, so you'ld see *something*. 
  Knowing this was b0rked (not matching source PDF any more) but 
  showing anything quickly is better than nothing given the 
  aforementioned goal.

- then we went for a more thorough approach: we *could* patch the 
  fz_context struct and thus propagate the `do_not_draw` flag/counter 
  to almost anywhere in the code

  + this was done and the knockout begin/end calls have been disabled,
    resulting in a *significant* drop in total render time. Still 
    the example PDF would take far more than 60 seconds to render 
    (**debug build with no optimizations!**)

  + NOTE the removal of the superfluous blendmode check in 
    knockout_start() -- while the check in the knockout_end() call 
    SHOULD remain --: it has been ascertained all calls to 
    knownout_begin and _end have been prefixed by that blendmode 
    check. Performance gain is of course trivial, but I was at it 
    anyway and patching those conditionals to include the new 
    ctx->do_not_draw flag check.

- the next round of profiling, where fill_path and stroke_path are 
  severely restricted using the do_not_draw logic, showed a 
  ~ 60/30% division of CPU cost between remaining fz_fill_path() and
  fz_stroke_path() calls on the one hand (60%) and the fz_end_group()
  call on the other hand (30%). 
  Going through the code and **rigorously* conditionally disabling 
  all knockout begin/end calls took care of that 60%.

- next round, it were the fz_group_begin and fz_group_end calls 
  which consumed the bulk of the CPU time.

  + I ran a series of tests to see what'd happen if I dialed the 
    draw/fill calls back up, now that the new logic for disabling 
    the knockout begin/end code was in place. 
    This was done by tweaking thee init code for `dnd_modulus` so 
    as to deliver a modulus of zero in the end, as it turned out 
    that now the grouping code was the big consumer anyhow.

  + This performance bottleneck is taken care of with the patch 
    in source/fitz/device.c

    YES, I know this is probably going to ruin other PDFs which 
    *need* the grouping, but it turns out that the rendered image 
    for the test PDF is quite nice with all those grouping calls 
    shut down. 

    Render times are now down to 10+ seconds for a fully instrumented 
    debug build on my i7 hardware.

- further inspection of a new round of profiler runs shows to me 
  that the low hanging fruit is *fini* by now: the costly calls are 
  several now (with the sample PDF now rendering its entire image 
  using all 2M+ path nodes!) and turns out to be another 
  60/30 division, this time between the collective of 
  fz_convert_rasterizer --> fz_convert_gel() calls @ 60%+ while 
  the rest is gobbled by the PDF parser: 
  pdf_process_stream @ 35% --> pdf_lex() @ 22% 
                             + pdf_process_keyword() @ 12%, 
  while drilldown into pdf_lex reveals lex_number() @ 14%, 
  while the next one in there is lex_white() @ 2.1%. 
  Drilldown into lex_number() --> fz_atof() @ 9%, which is divided 
  up between fz_strtof() @ 5% and, surprisingly, errno @ 3.2%
  
  Alas, loads of fun, but there's little I do about this as I don't see
  how I could "improve" fz_convert_gel(): the cost is pretty much 
  distributed among various calls in there. (fz_scan_convert_aa() 
  takes almost all, but that unfolds into 
  blit_aa @ 2.9%, undelta_aa @ 2.1%, even_odd_aa @ 1.2%, etc. and
  nothing that's obvious to me to "speed hack" there.

Conclusion:

- number one costly calls are the fz_knockout_begin() and 
  fz_knockout_end() calls here. 

  I don't know how many PDFs will trigger that KNOCKOUT blending 
  mode, but good golly, Miss Molly! What percentages!

- number two (which only becomes visible on the radar when you have 
  killed number one *totally* and *utterly*) are the combo of 
  fz_group_begin() and fz_group_end()

- when you take out those grouping calls as well, then the third 
  runner-up is fz_convert_gel() (by way of being called by 
  fz_convert_rasterizer() from various places)

- the last bit of cost is then at the final end: the stuff you 
  cannot do without when doing *anything* with a PDF: the parser. 

  As usual on most systems when a parser shows up on the radar 
  of a profiler run, it's actually the *lexer* that's the cost, 
  but then again: we have to realize that a 2M+ path nodes' 
  complex geo map drawing in a PDF will take a serious number 
  of *numbers* to lex. ;-)


Also, again, *do realize* that the ranking of these costly calls 
is such that in that level 1,2,3 list, *nothing* shows up on the radar 
when you haven't killed the topmost still-active item, so focusing 
on improving the PDF parser performance would be **utterly ludicrous**.



This entire drama started at 30+ minutes run-time (instrumented debug build, aborted!!!) and was brought down to ~10 seconds in the end, while luck would have it that the disabled / skipped code did not very noticeably damage the rendered output of THIS PDF.      

Of course, YMMV. 

Those groupings and blending modes aren't in there for no reason!

The attached patch represents the end state of this story / hack. Repo where this was done is at the github link at the top of this msg.




P.S.: 

I found out by code inspection that there's a cookie->abort flag, which might be useful for my second goal: interruptable => abortable page renderings!

However, do note that fiddling with that flag will cause an error to be fz_throw()n -- flip that flag in mupdf while running the app in the debugger and you're get an alert box and blank screen!

</aside>
Comment 3 Ger Hobbelt 2020-04-21 20:44:27 UTC
Created attachment 19161 [details]
HACKY patch to show performance gains in MuPDF
Comment 4 clamky 2023-02-18 17:14:52 UTC
- time ~/x/scm/git/mupdf/build/release/mutool draw -o 1.png /home/malc/x/1/pdf/ENR6.3-9.slow.rendering.pdf 1
page /home/malc/x/1/pdf/ENR6.3-9.slow.rendering.pdf 1
~/x/scm/git/mupdf/build/release/mutool draw -o 1.png  1  2345.96s user 0.44s system 99% cpu 39:15.47 total

mupdf from 4ea8f996e4be0e1eb05c9559da3409d6a32277a8
Comment 5 Krzysztof Kowalczyk 2023-02-18 18:02:08 UTC
The issue here is that fz_draw_fill_path and fz_draw_stroke_path call fz_knockout_begin() and fz_knockout_end)() thousands of times.

As an experiment I disabled them and the file rendered quickly.

That's not the solution but maybe there's a heuristic to avoid those operations as both pdf.js and pdfium render this quickly.


Here's my disable test:

diff --git a/mupdf/source/fitz/draw-device.c b/mupdf/source/fitz/draw-device.c
index d8e7b6481..64114af66 100644
--- a/mupdf/source/fitz/draw-device.c
+++ b/mupdf/source/fitz/draw-device.c
@@ -652,7 +652,7 @@ fz_draw_fill_path(fz_context *ctx, fz_device *devp, const fz_path *path, int eve
        if (fz_flatten_fill_path(ctx, rast, path, ctm, flatness, bbox, &bbox))
                return;

-       if (state->blendmode & FZ_BLEND_KNOCKOUT)
+       if (0 && state->blendmode & FZ_BLEND_KNOCKOUT)
                state = fz_knockout_begin(ctx, dev);

        eop = resolve_color(ctx, &op, color, colorspace, alpha, color_params, colorbv, state->dest, dev->overprint_possible);
@@ -675,7 +675,7 @@ fz_draw_fill_path(fz_context *ctx, fz_device *devp, const fz_path *path, int eve
                fz_convert_rasterizer(ctx, rast, even_odd, state->group_alpha, colorbv, 0);
        }

-       if (state->blendmode & FZ_BLEND_KNOCKOUT)
+       if (0 && state->blendmode & FZ_BLEND_KNOCKOUT)
                fz_knockout_end(ctx, dev);
 }

@@ -715,7 +715,7 @@ fz_draw_stroke_path(fz_context *ctx, fz_device *devp, const fz_path *path, const
        if (fz_flatten_stroke_path(ctx, rast, path, stroke, ctm, flatness, linewidth, bbox, &bbox))
                return;

-       if (state->blendmode & FZ_BLEND_KNOCKOUT)
+       if (0 && state->blendmode & FZ_BLEND_KNOCKOUT)
                state = fz_knockout_begin(ctx, dev);

        eop = resolve_color(ctx, &op, color, colorspace, alpha, color_params, colorbv, state->dest, dev->overprint_possible);
@@ -757,7 +757,7 @@ fz_draw_stroke_path(fz_context *ctx, fz_device *devp, const fz_path *path, const
        printf("\n");
 #endif

-       if (state->blendmode & FZ_BLEND_KNOCKOUT)
+       if (0 && state->blendmode & FZ_BLEND_KNOCKOUT)
                fz_knockout_end(ctx, dev);
 }
Comment 6 Robin Watts 2023-02-19 01:11:16 UTC
Fixed with:

commit b649763a6fde8af79c0055b2cc2ea4f89d4a9c85
Author: Robin Watts <Robin.Watts@artifex.com>
Date:   Thu Feb 16 19:09:56 2023 +0000

    Bug 701945: Avoid knockout processing for alpha=1.

    If alpha=1 then knockout has no effect. This enables us to save
    a lot of work.

Thanks for the report, and analysis.