Bug 702358 - mutool show outputs utf-8 encoded text as hex-encoded strings
Summary: mutool show outputs utf-8 encoded text as hex-encoded strings
Status: RESOLVED FIXED
Alias: None
Product: MuPDF
Classification: Unclassified
Component: mupdf (show other bugs)
Version: 1.16.1
Hardware: PC Linux
: P4 normal
Assignee: MuPDF bugs
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-04-24 19:59 UTC by Hardy Pottinger
Modified: 2020-05-07 08:34 UTC (History)
0 users

See Also:
Customer:
Word Size: ---


Attachments
A PDF file which includes UTF-8 encoded strings in the TOC/bookmark entries. (11.92 MB, application/pdf)
2020-04-24 19:59 UTC, Hardy Pottinger
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Hardy Pottinger 2020-04-24 19:59:52 UTC
Created attachment 19171 [details]
A PDF file which includes UTF-8 encoded strings in the TOC/bookmark entries.

We use mutool to provide a linked TOC view for an embedded PDF viewer, and this has proved to work well. However, we recently discovered that certain PDF files with UTF-8 encoding are showing hex-encoded strings in place of UTF-8 characters. I will attach one such file. I will observe that this file is properly handled by JHOVE, it is identified as well-formed and valid, and the TOC items all display with their correct UTF-8 characters. This implies that mutool show is not handling those characters correctly. 

Steps to reproduce:

run the following command on the attached file:

mutool show 3v76q8q5.pdf outline

results: text such as this appears:
"Bibliograf\xC3\xADa"

expected results: UTF-8 characters, such as:
"Bibliografía"
Comment 1 Tor Andersson 2020-04-27 10:58:10 UTC
This behavior is as intended, but maybe not as useful as it could be. I'm working on a patch to print the strings with verbatim UTF-8 characters rather than hex-encoding them.

If you want the outline in an easily parsed format, you can save this code as a show-outline.js and use "mutool run show-outline.js input.pdf" to get the outline as a JSON file.

    print(JSON.stringify(new Document(scriptArgs[0]).loadOutline());
Comment 2 Hardy Pottinger 2020-04-27 17:32:23 UTC
Thanks for working on a patch! We have a workaround that relies on a regex to find all the hex-encoded strings and then change them back into the original UTF-8 encoded characters. But, I do think your patch would be generally useful. Maybe make this new behavior enabled by a flag, in case someone is depending on the hex-encoded strings?

BTW, your sample JS snippet is missing a final closing paren ')' It's very cool to be able to get this information as JSON, and that might be useful for a future project.
Comment 3 Tor Andersson 2020-05-07 08:34:25 UTC
commit 54b21c0d2cc7635267cee58fca7e644a8f90e67a
Author: Tor Andersson <tor.andersson@artifex.com>
Date:   Mon Apr 27 11:22:57 2020 +0200

    Bug 702358: Use unicode escapes for printf %q strings.
    
    Also add a %Q for writing non-ASCII characters in UTF-8 (like %C).
    
    Use %Q when showing the outline.