Created attachment 19171 [details] A PDF file which includes UTF-8 encoded strings in the TOC/bookmark entries. We use mutool to provide a linked TOC view for an embedded PDF viewer, and this has proved to work well. However, we recently discovered that certain PDF files with UTF-8 encoding are showing hex-encoded strings in place of UTF-8 characters. I will attach one such file. I will observe that this file is properly handled by JHOVE, it is identified as well-formed and valid, and the TOC items all display with their correct UTF-8 characters. This implies that mutool show is not handling those characters correctly. Steps to reproduce: run the following command on the attached file: mutool show 3v76q8q5.pdf outline results: text such as this appears: "Bibliograf\xC3\xADa" expected results: UTF-8 characters, such as: "Bibliografía"
This behavior is as intended, but maybe not as useful as it could be. I'm working on a patch to print the strings with verbatim UTF-8 characters rather than hex-encoding them. If you want the outline in an easily parsed format, you can save this code as a show-outline.js and use "mutool run show-outline.js input.pdf" to get the outline as a JSON file. print(JSON.stringify(new Document(scriptArgs[0]).loadOutline());
Thanks for working on a patch! We have a workaround that relies on a regex to find all the hex-encoded strings and then change them back into the original UTF-8 encoded characters. But, I do think your patch would be generally useful. Maybe make this new behavior enabled by a flag, in case someone is depending on the hex-encoded strings? BTW, your sample JS snippet is missing a final closing paren ')' It's very cool to be able to get this information as JSON, and that might be useful for a future project.
commit 54b21c0d2cc7635267cee58fca7e644a8f90e67a Author: Tor Andersson <tor.andersson@artifex.com> Date: Mon Apr 27 11:22:57 2020 +0200 Bug 702358: Use unicode escapes for printf %q strings. Also add a %Q for writing non-ASCII characters in UTF-8 (like %C). Use %Q when showing the outline.