691746 – PdfDraw outputs hebrew letters in reverse order.

Bug 691746 - PdfDraw outputs hebrew letters in reverse order.

Summary: PdfDraw outputs hebrew letters in reverse order.

Status:	RESOLVED DUPLICATE of bug 691056

Alias:	None

Product:	MuPDF
Classification:	Unclassified
Component:	apps (show other bugs)
Version:	unspecified
Hardware:	PC All

Importance:	P4 normal
Assignee:	Tor Andersson

URL:
Keywords:

Depends on:
Blocks:

Reported:	2010-11-03 19:30 UTC by Benjamin Ullian
Modified:	2011-02-02 20:32 UTC (History)
CC List:	2 users (show)

See Also:
Customer:
Word Size:	---

Attachments
Input PDF, output XML+TXT, PNG showing the problem. (126.08 KB, application/x-zip-compressed) 2010-11-03 19:30 UTC, Benjamin Ullian	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Benjamin Ullian 2010-11-03 19:30:36 UTC

Created attachment 6865 [details]
Input PDF, output XML+TXT, PNG showing the problem.

Hi MuPDF folks,

Thanks for the terrific library and tools. 

I have recently started working with PDFs in Hebrew, which is written Right-To-Left. What I've noticed is that pdfdraw, in both -t (text) and -tt (xml) outputs the letters in the incorrect order.

In Unicode strings, as i'm sure you already know, the order of characters in the string ("order" as counted by *increasing* character index) should correspond with the reading order. The characters themselves, due to the unicode BiDi algoritm, then impart directionalities on the subsequences of the string itself, and as such the string is stored in reading order and still rendered correctly.

As such, the bytes of a string in a LTR language and the bytes of a string in an RTL language should always describe characters __in reading order__. It seems that pdfdraw, for RTL languages, outputs strings in LTR order (which is reverse what is expected) -- the leftmost character is output first, and then the renderer render it RTL so the string appears reversed,


I've attached a simple pdf and the .xml and .txt files from the bleeding-edge pdfdraw in git. The problem is apparent from even the first line in the file which i have marked up in the PNG to illustrate the problem. 


I would be happy to work on a fix to this problem myself, by somehow integrating the Unicode BiDi algorithm, but I was wondering before I attempt such an endeavor, if this problem is known by your team and/or if a fix is already in development, or if you have any suggestions as to where this would be best implemented into MuPDF.


Thanks
Ben

Comment 1 zeniko 2010-11-03 23:07:56 UTC

See http://code.google.com/p/sumatrapdf/source/detail?r=1693 for the patch we're currently using in SumatraPDF's MuPDF. We'll accept patches to further improve our situation.

Comment 2 Tor Andersson 2011-02-02 20:32:14 UTC

PDF has all text in visual order, and that's what pdfdraw currently emits. SumatraPDF has a patch to do BiDi-reordering and also converts some common characters from unicode normal form d to normal form c (precomposed). I'm considering a more thorough approach based on the unicode data tables in MuPDF but I have not had time to implement it yet.

*** This bug has been marked as a duplicate of bug 691056 ***