Created attachment 6865 [details] Input PDF, output XML+TXT, PNG showing the problem. Hi MuPDF folks, Thanks for the terrific library and tools. I have recently started working with PDFs in Hebrew, which is written Right-To-Left. What I've noticed is that pdfdraw, in both -t (text) and -tt (xml) outputs the letters in the incorrect order. In Unicode strings, as i'm sure you already know, the order of characters in the string ("order" as counted by *increasing* character index) should correspond with the reading order. The characters themselves, due to the unicode BiDi algoritm, then impart directionalities on the subsequences of the string itself, and as such the string is stored in reading order and still rendered correctly. As such, the bytes of a string in a LTR language and the bytes of a string in an RTL language should always describe characters __in reading order__. It seems that pdfdraw, for RTL languages, outputs strings in LTR order (which is reverse what is expected) -- the leftmost character is output first, and then the renderer render it RTL so the string appears reversed, I've attached a simple pdf and the .xml and .txt files from the bleeding-edge pdfdraw in git. The problem is apparent from even the first line in the file which i have marked up in the PNG to illustrate the problem. I would be happy to work on a fix to this problem myself, by somehow integrating the Unicode BiDi algorithm, but I was wondering before I attempt such an endeavor, if this problem is known by your team and/or if a fix is already in development, or if you have any suggestions as to where this would be best implemented into MuPDF. Thanks Ben
See http://code.google.com/p/sumatrapdf/source/detail?r=1693 for the patch we're currently using in SumatraPDF's MuPDF. We'll accept patches to further improve our situation.
PDF has all text in visual order, and that's what pdfdraw currently emits. SumatraPDF has a patch to do BiDi-reordering and also converts some common characters from unicode normal form d to normal form c (precomposed). I'm considering a more thorough approach based on the unicode data tables in MuPDF but I have not had time to implement it yet. *** This bug has been marked as a duplicate of bug 691056 ***