Bug 702509 - Possible v1.17 regression in fz_new_stext_page_from_page function (getting text and character positions)
Summary: Possible v1.17 regression in fz_new_stext_page_from_page function (getting te...
Status: RESOLVED DUPLICATE of bug 701977
Alias: None
Product: MuPDF
Classification: Unclassified
Component: fitz (show other bugs)
Version: unspecified
Hardware: PC Windows 10
: P4 normal
Assignee: MuPDF bugs
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-06-21 21:22 UTC by willus
Modified: 2020-12-15 01:17 UTC (History)
1 user (show)

See Also:
Customer:
Word Size: ---


Attachments
Zip archive w/attached files (161.57 KB, application/x-zip-compressed)
2020-06-21 21:22 UTC, willus
Details
Outputs from mudraw.exe on chartest2.pdf file (54.41 KB, application/zip)
2020-06-22 18:24 UTC, willus
Details
mudraw -Fstext chartest2.pdf output w/v1.17, FT v2.10.2, compiled w/old engine switch (91.03 KB, text/xml)
2020-06-23 03:13 UTC, willus
Details
mutool draw -Fstext chartest2.pdf (90.29 KB, application/xml)
2020-06-23 09:47 UTC, Tor Andersson
Details

Note You need to log in before you can comment on or make changes to this bug.
Description willus 2020-06-21 21:22:38 UTC
Created attachment 19360 [details]
Zip archive w/attached files

Running fz_new_stext_page_from_page() function on the attached PDF file in mupdf v1.17 gives strange results--it seems to "invent" spaces inside many of the words.  Mupdf v1.14 (the last version I had compiled from before) gave a much more normal result with this identical PDF file--no such issues.  I'd appreciate it if somebody at Artifex could verify this.

Attachments included in the zip archive:
1. The main C function I used to get the char positions--I apologize it's not self-contained.
2. The PDF file I used
3. The output when compiled w/mupdf v1.14
4. The output when compiled w/mupdf v1.17

This is compiled on a Windows 10 system, 64-bit, with MinGW / gcc v9.3.1.
Comment 1 Tor Andersson 2020-06-22 14:28:31 UTC
The output of "mutool draw -Fstext chartest2.pdf" looks perfectly reasonable to me. We have run into bugs in certain versions of FreeType that affect this though. See bug 701977 for an example.

Which version of FreeType did you link against? If you have linked against a system version of FreeType with that bug present, that could explain the problems.
Comment 2 willus 2020-06-22 16:09:57 UTC
Thank you for investigating so quickly.  I linked against Freetype 2.10.2, released 8 May 2020.  I take it you recommend the Freetype package included with the MuPDF v1.17 source distribution--2.10.0?  It seems strange that 2.10.2 would introduce a new bug over 2.10.0, but I will try the exact package that is in the mupdf v1.17 distro.
Comment 3 Tor Andersson 2020-06-22 16:21:46 UTC
The crucial part here is that the copy in thirdparty/freetype builds with a compiler flag that works around the bug in FreeType.

*** This bug has been marked as a duplicate of bug 701977 ***
Comment 4 willus 2020-06-22 18:24:38 UTC
Created attachment 19366 [details]
Outputs from mudraw.exe on chartest2.pdf file
Comment 5 willus 2020-06-22 18:27:58 UTC
I added a new attachment.  I built mudraw.exe from muPDF v1.14 with FreeType 2.9.1 and FreeType 2.10.2 and then from muPDF v1.17 with the same two FreeType versions.  In all cases, I get the anomalous results only with muPDF v1.17.  Mudraw from MuPDF v1.14 worked fine with either FreeType package, including the most recent FreeType package (v2.10.2).  I must be compiling v1.17 differently somehow...different flag?  Any clues on where to focus my efforts would be appreciated.  Could it be in how I built up the font source files?

Can you post the output from you mutool -Fstext run?
Comment 6 willus 2020-06-22 20:52:17 UTC
FYI, when I copied my font data files from my MuPDF v1.14 build folder to my MuPDF v1.17 build folder (Droid*.c, Nimbus*.c, Standard*.c, Dingbats.c), the result changed--it got better (fewer inserted spaces), but not as good as the straight MuPDF v1.14 results.
Comment 7 willus 2020-06-23 03:11:42 UTC
Hi--sorry for all the additional comments, Tor--and I really appreciate your help.  I followed the discussion of the other bug report (901977) to the freetype bug report where you recommended compiling freetype with T1_CONFIG_OPTION_OLD_ENGINE=1.  This definitely helped in my case but still didn't give the exact same results as MuPDF v1.14 for my test file (the blocks/lines were organized slightly differently--see attached).  Did you use this option (T1_CONFIG_OPTION_OLD_ENGINE=1) to build your version of mutool?  Is that why you got good results with it?  Again, I'd like to see a post of your mutool -Fstext results on my test file so I can compare to it.

I would also like to understand why MuPDF v1.14 doesn't seem to be affected by the freetype bug.
Comment 8 willus 2020-06-23 03:13:37 UTC
Created attachment 19367 [details]
mudraw -Fstext chartest2.pdf output w/v1.17, FT v2.10.2, compiled w/old engine switch
Comment 9 Tor Andersson 2020-06-23 09:42:59 UTC
So, the bug I referenced before was fixed, but there's another related bug that this file exposes. I've reported it upstream here:

https://savannah.nongnu.org/bugs/index.php?58646

Until that bug is fixed you'll have to build freetype using the T1_CONFIG_OPTION_OLD_ENGINE=1 option.

The heuristics for structured text extraction are continuously tweaked and improved, so I'm not surprised you're getting slightly different results from mupdf 1.14 and 1.17 even with the same FreeType.

MuPDF 1.14 was built against an older version of FreeType that did not have these bugs. These bugs were introduced when they added a special fast metrics-only type1 decoder that's used to get the font metrics only without parsing the outlines.
Comment 10 Tor Andersson 2020-06-23 09:44:30 UTC
Here is the output from the latest mupdf source using a FreeType with T1_CONFIG_OPTION_OLD_ENGINE=1.

$ mutool draw -Ftext chartest2.pdf 

This section is inserted by LATEX; you do not insert it. You
just add the names and information in the \additionalauthors
command at the start of the document.

A.6
References
Generated by bibtex from your .bib file. Run latex, then
bibtex, then latex twice (to resolve references) to create the
.bbl file. Insert that
.bbl file into the .tex source file and
comment out the command \thebibliography.

B.
MORE HELP FOR THE HARDY
The sig-alternate.cls file itself is chock-full of succinct and
helpful comments.
If you consider yourself a moderately
experienced to expert user of LATEX, you may find reading
it useful but please remember not to change it.
Comment 11 Tor Andersson 2020-06-23 09:47:08 UTC
Created attachment 19371 [details]
mutool draw -Fstext chartest2.pdf
Comment 12 willus 2020-06-23 14:13:27 UTC
Tor--thank you for patiently answering my questions.  Very helpful!  I will say it is a bit frustrating to me that text selection/outline info hasn't stabilized more reliably.  Like you said, it seems to be continually getting tweaked and often with unintended consequences.  It is probably the thing I have fought with the most whenever I have moved to a new release of MuPDF (and, corresondingly, FreeType).  But I cannot complain too much.  MuPDF is an amazingly complete and helpful library which I am very pleased to be able to use.  Thanks to you and the team for all of the hard work.
Comment 13 spambin 2020-12-15 01:14:22 UTC
Tor

Since both first and second freetype issues
 were fixed in freetype 10.2.2 and 10.2.3
 are you able to sync MuPDf lib to latest system version ?

Reason I raise this is that SumatraPDF is following your lead
 and applying your fix to more static 10.2.0

see discussion in https://github.com/sumatrapdfreader/sumatrapdf/issues/1632
Comment 14 spambin 2020-12-15 01:17:08 UTC
(In reply to spambin from comment #13)
> Tor
> 
> Since both first and second freetype issues
>  were fixed in freetype 10.2.2 and 10.2.3
>  are you able to sync MuPDf lib to latest system version ?
> 
> Reason I raise this is that SumatraPDF is following your lead
>  and applying your fix to more static 10.2.0
> 
> see discussion in https://github.com/sumatrapdfreader/sumatrapdf/issues/1632

Sorry those should have read 2.10.#