Bug 692308 - improve extracting text in right-to-left alphabets
Summary: improve extracting text in right-to-left alphabets
Status: RESOLVED FIXED
Alias: None
Product: MuPDF
Classification: Unclassified
Component: mupdf (show other bugs)
Version: unspecified
Hardware: PC Windows 7
: P4 normal
Assignee: MuPDF bugs
URL: http://code.google.com/p/sumatrapdf/i...
Keywords:
Depends on:
Blocks:
 
Reported: 2011-06-28 14:42 UTC by zeniko
Modified: 2014-04-17 06:03 UTC (History)
1 user (show)

See Also:
Customer:
Word Size: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description zeniko 2011-06-28 14:42:16 UTC
Adobe Reader is much more successful for extracting text e.g. from http://www.ice.gov/doclib/sevis/pdf/sevis_arabic_fs.pdf (one of the first results from http://www.google.com/search?q=arabic+ext%3Apdf ). This seems partially related to dev_text not expecting RtL text and inserting too many unintended linebreaks, and also due to Unicode normalization divergences.
Comment 1 Tor Andersson 2014-04-17 06:03:19 UTC
Hopefully fixed in commit cffcdf1ab2189a55b09b8ac74d552e6a2e809510
Author: Tor Andersson <tor.andersson@artifex.com>
Date:   Fri May 3 16:33:31 2013 +0200

    Add simple visual-to-logic RTL reordering as a text extraction pass.