Bug 692308

Summary:	improve extracting text in right-to-left alphabets
Product:	MuPDF	Reporter:	zeniko
Component:	mupdf	Assignee:	MuPDF bugs <mupdf-bugs>
Status:	RESOLVED FIXED
Severity:	normal	CC:	tor.andersson
Priority:	P4
Version:	unspecified
Hardware:	PC
OS:	Windows 7
URL:	http://code.google.com/p/sumatrapdf/issues/detail?id=1466
Customer:		Word Size:	---

Description zeniko 2011-06-28 14:42:16 UTC

Adobe Reader is much more successful for extracting text e.g. from http://www.ice.gov/doclib/sevis/pdf/sevis_arabic_fs.pdf (one of the first results from http://www.google.com/search?q=arabic+ext%3Apdf ). This seems partially related to dev_text not expecting RtL text and inserting too many unintended linebreaks, and also due to Unicode normalization divergences.

Comment 1 Tor Andersson 2014-04-17 06:03:19 UTC

Hopefully fixed in commit cffcdf1ab2189a55b09b8ac74d552e6a2e809510
Author: Tor Andersson <tor.andersson@artifex.com>
Date:   Fri May 3 16:33:31 2013 +0200

    Add simple visual-to-logic RTL reordering as a text extraction pass.