Bug 692308

Summary: improve extracting text in right-to-left alphabets
Product: MuPDF Reporter: zeniko
Component: mupdfAssignee: MuPDF bugs <mupdf-bugs>
Status: RESOLVED FIXED    
Severity: normal CC: tor.andersson
Priority: P4    
Version: unspecified   
Hardware: PC   
OS: Windows 7   
URL: http://code.google.com/p/sumatrapdf/issues/detail?id=1466
Customer: Word Size: ---

Description zeniko 2011-06-28 14:42:16 UTC
Adobe Reader is much more successful for extracting text e.g. from http://www.ice.gov/doclib/sevis/pdf/sevis_arabic_fs.pdf (one of the first results from http://www.google.com/search?q=arabic+ext%3Apdf ). This seems partially related to dev_text not expecting RtL text and inserting too many unintended linebreaks, and also due to Unicode normalization divergences.
Comment 1 Tor Andersson 2014-04-17 06:03:19 UTC
Hopefully fixed in commit cffcdf1ab2189a55b09b8ac74d552e6a2e809510
Author: Tor Andersson <tor.andersson@artifex.com>
Date:   Fri May 3 16:33:31 2013 +0200

    Add simple visual-to-logic RTL reordering as a text extraction pass.