Created attachment 18645 [details] slow rendering of a pdf Originally reported against SumatraPDF: https://github.com/sumatrapdfreader/sumatrapdf/issues/1030 Attached PDF is a rather complex, single page vector file but Chrome (i.e. Pdfium i.e. Foxit) renders it reasonably fast (just few seconds). Even pdf.js (https://mozilla.github.io/pdf.js/web/viewer.html) renders it in few seconds. Mupdf never finishes (I leet it run for minutes). Looks like an an infinite loop somewhere It keeps doing things but never finishes fz_run_display_list. Here's a profiling sample: Function Name Total CPU [unit, %] + SumatraPDF.exe (PID: 11464) 5682 (100.00%) | - fz_draw_stroke_path 5648 (99.40%) | - fz_run_display_list 5648 (99.40%) | - fz_stroke_path 5648 (99.40%) | - PdfEngineImpl::RenderBitmap 5648 (99.40%) | - RenderCache::RenderCacheThread 5648 (99.40%) | - fz_knockout_begin 2878 (50.65%) | - fz_knockout_end 2757 (48.52%) | - memcpy_repmovs 1871 (32.93%) | - fz_copy_pixmap_rect 1871 (32.93%) | - fz_blend_pixmap_knockout 1819 (32.01%) | - fz_blend_knockout 1817 (31.98%) | - [External Call] RtlFreeHeap 938 (16.51%) | - _free_base 938 (16.51%) | - fz_drop_pixmap_imp 938 (16.51%) | - fz_drop_storable 938 (16.51%) | - fz_clear_pixmap 696 (12.25%) | - memset_repmovs 694 (12.21%) | - do_scavenging_malloc 311 (5.47%) | - fz_new_pixmap 311 (5.47%) | - fz_new_pixmap_with_data 311 (5.47%) | - _malloc_base 310 (5.46%)
I take back the "infinite" part. It finished after probably more than 10 minutes.
Some analysis done and results committed here: https://github.com/GerHobbelt/mupdf/commit/165d4088305fefc4880acf6f5535254cd8b83d5c --- As described in the commit msg below I was looking for ways to make MuPDF page rendering abortable & time limited. the PDF attached by OP has been used for profiling the local build of a (patched) MuPDF codebase, which, I believe, translated 1:1 to the mainline codebase. The profiling results do not offer a true solution (even though a 30+ minutes run was cut down to deliver in 10 seconds only) but MAY hopefully help the maintainers who understand this code better than I do, what might be worth doing about it. Copy of the commit message (diff attached as patch file): ---- # Work done re https://bugs.ghostscript.com/show_bug.cgi?id=701945 :: (using the profiler (MSVC2019, Win10/64) it turns out that the *big* CPU consumers are the fz_knockout_begin() and fz_knockout_end() calls, clocking in at a lump sum of >92% of total CPU used) The goal of this work was to find out if (and how) I could get MuPDF to render any pages within a given time frame and, without barfing a hairball via fz_throw(), if such is not possible, to at least render *something*, even though it will be 'corrupt'. The goal here is to first check this and make it possible, so I will know where to patch the code properly later on when I want this behaviour: interruptable and/or time-limited renders which give a **warning** without crashing through a C# .NET interface layer thanks to setjmp/longjmp in fz_try/fz_throw/fz_catch. To that end, this HACKY CODE HACK has been done: - first there was the `do_not_edit` flag in the major CPU guzzling routine fz_run_display_list() @ >98% CPU. - then I added code to turn that flag on after a bit of time has elapsed. **Not** to hammer the OS and eat up lots of CPU myself, the time check was limited to once per 100+ progress points: `if (prev_progress + 100 <= progress && !do_not_draw)` - then I checked which costly calls could be safely skipped. + turns out, alas, that simply setting the `clipped` flag, while it *looked* nice on the surface, immediately started to bomb the code with fz_throw() exceptions due to failing checks, so that was/is a no-go. + turns out anything that's a group and thus expected to maintain some sort of push/pop stack, is out of the question for a direct skip in the big switch() statement. + as over 90% of the CPU cost was expended in fz_fill_path() and fz_stroke_path() functions and more specifically in the fz_knockout_begin/fz_knockout_end calls there-in, it was quickly attempted to hack the triggering BLENDMODE (bad idea, not just because the struct is undefined there, but also because a patched blendmode would affect other code chunks as there's plenty other checks for blendmode elsewhere in the code!) - after having tested with the `do_not_draw` variable as a *flag* to turn off fz_fill_path() and fz_stroke_path() entry, the next step was to tweak: the `do_not_edit` flag was turned in a counter and a simple modulo code (as shown in this commit) was to ensure at least *some* of the elements were rendered, so you'ld see *something*. Knowing this was b0rked (not matching source PDF any more) but showing anything quickly is better than nothing given the aforementioned goal. - then we went for a more thorough approach: we *could* patch the fz_context struct and thus propagate the `do_not_draw` flag/counter to almost anywhere in the code + this was done and the knockout begin/end calls have been disabled, resulting in a *significant* drop in total render time. Still the example PDF would take far more than 60 seconds to render (**debug build with no optimizations!**) + NOTE the removal of the superfluous blendmode check in knockout_start() -- while the check in the knockout_end() call SHOULD remain --: it has been ascertained all calls to knownout_begin and _end have been prefixed by that blendmode check. Performance gain is of course trivial, but I was at it anyway and patching those conditionals to include the new ctx->do_not_draw flag check. - the next round of profiling, where fill_path and stroke_path are severely restricted using the do_not_draw logic, showed a ~ 60/30% division of CPU cost between remaining fz_fill_path() and fz_stroke_path() calls on the one hand (60%) and the fz_end_group() call on the other hand (30%). Going through the code and **rigorously* conditionally disabling all knockout begin/end calls took care of that 60%. - next round, it were the fz_group_begin and fz_group_end calls which consumed the bulk of the CPU time. + I ran a series of tests to see what'd happen if I dialed the draw/fill calls back up, now that the new logic for disabling the knockout begin/end code was in place. This was done by tweaking thee init code for `dnd_modulus` so as to deliver a modulus of zero in the end, as it turned out that now the grouping code was the big consumer anyhow. + This performance bottleneck is taken care of with the patch in source/fitz/device.c YES, I know this is probably going to ruin other PDFs which *need* the grouping, but it turns out that the rendered image for the test PDF is quite nice with all those grouping calls shut down. Render times are now down to 10+ seconds for a fully instrumented debug build on my i7 hardware. - further inspection of a new round of profiler runs shows to me that the low hanging fruit is *fini* by now: the costly calls are several now (with the sample PDF now rendering its entire image using all 2M+ path nodes!) and turns out to be another 60/30 division, this time between the collective of fz_convert_rasterizer --> fz_convert_gel() calls @ 60%+ while the rest is gobbled by the PDF parser: pdf_process_stream @ 35% --> pdf_lex() @ 22% + pdf_process_keyword() @ 12%, while drilldown into pdf_lex reveals lex_number() @ 14%, while the next one in there is lex_white() @ 2.1%. Drilldown into lex_number() --> fz_atof() @ 9%, which is divided up between fz_strtof() @ 5% and, surprisingly, errno @ 3.2% Alas, loads of fun, but there's little I do about this as I don't see how I could "improve" fz_convert_gel(): the cost is pretty much distributed among various calls in there. (fz_scan_convert_aa() takes almost all, but that unfolds into blit_aa @ 2.9%, undelta_aa @ 2.1%, even_odd_aa @ 1.2%, etc. and nothing that's obvious to me to "speed hack" there. Conclusion: - number one costly calls are the fz_knockout_begin() and fz_knockout_end() calls here. I don't know how many PDFs will trigger that KNOCKOUT blending mode, but good golly, Miss Molly! What percentages! - number two (which only becomes visible on the radar when you have killed number one *totally* and *utterly*) are the combo of fz_group_begin() and fz_group_end() - when you take out those grouping calls as well, then the third runner-up is fz_convert_gel() (by way of being called by fz_convert_rasterizer() from various places) - the last bit of cost is then at the final end: the stuff you cannot do without when doing *anything* with a PDF: the parser. As usual on most systems when a parser shows up on the radar of a profiler run, it's actually the *lexer* that's the cost, but then again: we have to realize that a 2M+ path nodes' complex geo map drawing in a PDF will take a serious number of *numbers* to lex. ;-) Also, again, *do realize* that the ranking of these costly calls is such that in that level 1,2,3 list, *nothing* shows up on the radar when you haven't killed the topmost still-active item, so focusing on improving the PDF parser performance would be **utterly ludicrous**. This entire drama started at 30+ minutes run-time (instrumented debug build, aborted!!!) and was brought down to ~10 seconds in the end, while luck would have it that the disabled / skipped code did not very noticeably damage the rendered output of THIS PDF. Of course, YMMV. Those groupings and blending modes aren't in there for no reason! The attached patch represents the end state of this story / hack. Repo where this was done is at the github link at the top of this msg. P.S.: I found out by code inspection that there's a cookie->abort flag, which might be useful for my second goal: interruptable => abortable page renderings! However, do note that fiddling with that flag will cause an error to be fz_throw()n -- flip that flag in mupdf while running the app in the debugger and you're get an alert box and blank screen! </aside>
Created attachment 19161 [details] HACKY patch to show performance gains in MuPDF
- time ~/x/scm/git/mupdf/build/release/mutool draw -o 1.png /home/malc/x/1/pdf/ENR6.3-9.slow.rendering.pdf 1 page /home/malc/x/1/pdf/ENR6.3-9.slow.rendering.pdf 1 ~/x/scm/git/mupdf/build/release/mutool draw -o 1.png 1 2345.96s user 0.44s system 99% cpu 39:15.47 total mupdf from 4ea8f996e4be0e1eb05c9559da3409d6a32277a8
The issue here is that fz_draw_fill_path and fz_draw_stroke_path call fz_knockout_begin() and fz_knockout_end)() thousands of times. As an experiment I disabled them and the file rendered quickly. That's not the solution but maybe there's a heuristic to avoid those operations as both pdf.js and pdfium render this quickly. Here's my disable test: diff --git a/mupdf/source/fitz/draw-device.c b/mupdf/source/fitz/draw-device.c index d8e7b6481..64114af66 100644 --- a/mupdf/source/fitz/draw-device.c +++ b/mupdf/source/fitz/draw-device.c @@ -652,7 +652,7 @@ fz_draw_fill_path(fz_context *ctx, fz_device *devp, const fz_path *path, int eve if (fz_flatten_fill_path(ctx, rast, path, ctm, flatness, bbox, &bbox)) return; - if (state->blendmode & FZ_BLEND_KNOCKOUT) + if (0 && state->blendmode & FZ_BLEND_KNOCKOUT) state = fz_knockout_begin(ctx, dev); eop = resolve_color(ctx, &op, color, colorspace, alpha, color_params, colorbv, state->dest, dev->overprint_possible); @@ -675,7 +675,7 @@ fz_draw_fill_path(fz_context *ctx, fz_device *devp, const fz_path *path, int eve fz_convert_rasterizer(ctx, rast, even_odd, state->group_alpha, colorbv, 0); } - if (state->blendmode & FZ_BLEND_KNOCKOUT) + if (0 && state->blendmode & FZ_BLEND_KNOCKOUT) fz_knockout_end(ctx, dev); } @@ -715,7 +715,7 @@ fz_draw_stroke_path(fz_context *ctx, fz_device *devp, const fz_path *path, const if (fz_flatten_stroke_path(ctx, rast, path, stroke, ctm, flatness, linewidth, bbox, &bbox)) return; - if (state->blendmode & FZ_BLEND_KNOCKOUT) + if (0 && state->blendmode & FZ_BLEND_KNOCKOUT) state = fz_knockout_begin(ctx, dev); eop = resolve_color(ctx, &op, color, colorspace, alpha, color_params, colorbv, state->dest, dev->overprint_possible); @@ -757,7 +757,7 @@ fz_draw_stroke_path(fz_context *ctx, fz_device *devp, const fz_path *path, const printf("\n"); #endif - if (state->blendmode & FZ_BLEND_KNOCKOUT) + if (0 && state->blendmode & FZ_BLEND_KNOCKOUT) fz_knockout_end(ctx, dev); }
Fixed with: commit b649763a6fde8af79c0055b2cc2ea4f89d4a9c85 Author: Robin Watts <Robin.Watts@artifex.com> Date: Thu Feb 16 19:09:56 2023 +0000 Bug 701945: Avoid knockout processing for alpha=1. If alpha=1 then knockout has no effect. This enables us to save a lot of work. Thanks for the report, and analysis.