Created attachment 19360 [details] Zip archive w/attached files Running fz_new_stext_page_from_page() function on the attached PDF file in mupdf v1.17 gives strange results--it seems to "invent" spaces inside many of the words. Mupdf v1.14 (the last version I had compiled from before) gave a much more normal result with this identical PDF file--no such issues. I'd appreciate it if somebody at Artifex could verify this. Attachments included in the zip archive: 1. The main C function I used to get the char positions--I apologize it's not self-contained. 2. The PDF file I used 3. The output when compiled w/mupdf v1.14 4. The output when compiled w/mupdf v1.17 This is compiled on a Windows 10 system, 64-bit, with MinGW / gcc v9.3.1.
The output of "mutool draw -Fstext chartest2.pdf" looks perfectly reasonable to me. We have run into bugs in certain versions of FreeType that affect this though. See bug 701977 for an example. Which version of FreeType did you link against? If you have linked against a system version of FreeType with that bug present, that could explain the problems.
Thank you for investigating so quickly. I linked against Freetype 2.10.2, released 8 May 2020. I take it you recommend the Freetype package included with the MuPDF v1.17 source distribution--2.10.0? It seems strange that 2.10.2 would introduce a new bug over 2.10.0, but I will try the exact package that is in the mupdf v1.17 distro.
The crucial part here is that the copy in thirdparty/freetype builds with a compiler flag that works around the bug in FreeType. *** This bug has been marked as a duplicate of bug 701977 ***
Created attachment 19366 [details] Outputs from mudraw.exe on chartest2.pdf file
I added a new attachment. I built mudraw.exe from muPDF v1.14 with FreeType 2.9.1 and FreeType 2.10.2 and then from muPDF v1.17 with the same two FreeType versions. In all cases, I get the anomalous results only with muPDF v1.17. Mudraw from MuPDF v1.14 worked fine with either FreeType package, including the most recent FreeType package (v2.10.2). I must be compiling v1.17 differently somehow...different flag? Any clues on where to focus my efforts would be appreciated. Could it be in how I built up the font source files? Can you post the output from you mutool -Fstext run?
FYI, when I copied my font data files from my MuPDF v1.14 build folder to my MuPDF v1.17 build folder (Droid*.c, Nimbus*.c, Standard*.c, Dingbats.c), the result changed--it got better (fewer inserted spaces), but not as good as the straight MuPDF v1.14 results.
Hi--sorry for all the additional comments, Tor--and I really appreciate your help. I followed the discussion of the other bug report (901977) to the freetype bug report where you recommended compiling freetype with T1_CONFIG_OPTION_OLD_ENGINE=1. This definitely helped in my case but still didn't give the exact same results as MuPDF v1.14 for my test file (the blocks/lines were organized slightly differently--see attached). Did you use this option (T1_CONFIG_OPTION_OLD_ENGINE=1) to build your version of mutool? Is that why you got good results with it? Again, I'd like to see a post of your mutool -Fstext results on my test file so I can compare to it. I would also like to understand why MuPDF v1.14 doesn't seem to be affected by the freetype bug.
Created attachment 19367 [details] mudraw -Fstext chartest2.pdf output w/v1.17, FT v2.10.2, compiled w/old engine switch
So, the bug I referenced before was fixed, but there's another related bug that this file exposes. I've reported it upstream here: https://savannah.nongnu.org/bugs/index.php?58646 Until that bug is fixed you'll have to build freetype using the T1_CONFIG_OPTION_OLD_ENGINE=1 option. The heuristics for structured text extraction are continuously tweaked and improved, so I'm not surprised you're getting slightly different results from mupdf 1.14 and 1.17 even with the same FreeType. MuPDF 1.14 was built against an older version of FreeType that did not have these bugs. These bugs were introduced when they added a special fast metrics-only type1 decoder that's used to get the font metrics only without parsing the outlines.
Here is the output from the latest mupdf source using a FreeType with T1_CONFIG_OPTION_OLD_ENGINE=1. $ mutool draw -Ftext chartest2.pdf This section is inserted by LATEX; you do not insert it. You just add the names and information in the \additionalauthors command at the start of the document. A.6 References Generated by bibtex from your .bib file. Run latex, then bibtex, then latex twice (to resolve references) to create the .bbl file. Insert that .bbl file into the .tex source file and comment out the command \thebibliography. B. MORE HELP FOR THE HARDY The sig-alternate.cls file itself is chock-full of succinct and helpful comments. If you consider yourself a moderately experienced to expert user of LATEX, you may find reading it useful but please remember not to change it.
Created attachment 19371 [details] mutool draw -Fstext chartest2.pdf
Tor--thank you for patiently answering my questions. Very helpful! I will say it is a bit frustrating to me that text selection/outline info hasn't stabilized more reliably. Like you said, it seems to be continually getting tweaked and often with unintended consequences. It is probably the thing I have fought with the most whenever I have moved to a new release of MuPDF (and, corresondingly, FreeType). But I cannot complain too much. MuPDF is an amazingly complete and helpful library which I am very pleased to be able to use. Thanks to you and the team for all of the hard work.
Tor Since both first and second freetype issues were fixed in freetype 10.2.2 and 10.2.3 are you able to sync MuPDf lib to latest system version ? Reason I raise this is that SumatraPDF is following your lead and applying your fix to more static 10.2.0 see discussion in https://github.com/sumatrapdfreader/sumatrapdf/issues/1632
(In reply to spambin from comment #13) > Tor > > Since both first and second freetype issues > were fixed in freetype 10.2.2 and 10.2.3 > are you able to sync MuPDf lib to latest system version ? > > Reason I raise this is that SumatraPDF is following your lead > and applying your fix to more static 10.2.0 > > see discussion in https://github.com/sumatrapdfreader/sumatrapdf/issues/1632 Sorry those should have read 2.10.#