Bug 691056

Summary: Normalize unicode text into NFC for better search.
Product: MuPDF Reporter: zeniko
Component: mupdfAssignee: MuPDF bugs <mupdf-bugs>
Status: CONFIRMED ---    
Severity: enhancement CC: bullian, christinedelight.top85, sebastian.rasmussen, tor.andersson
Priority: P4    
Version: unspecified   
Hardware: PC   
OS: Windows XP   
Customer: Word Size: ---
Bug Depends on:    
Bug Blocks: 690681    
Attachments: example with combining characters ('å' as U+02DA 'a')

Description zeniko 2010-01-13 11:45:33 UTC
... else every consumer of pdf_loadtextfromtree will have to do it instead.

See http://code.google.com/p/sumatrapdf/source/detail?r=1693 for what we currently 
need to make documents searchable in SumatraPDF.
Comment 1 zeniko 2010-01-13 11:46:12 UTC
Ligature expansion has also been reported on its own as bug 690681.
Comment 2 Sebastian Rasmussen 2010-08-08 17:37:51 UTC
The poppler guys kindly link to another testfile:
https://bugs.freedesktop.org/show_bug.cgi?id=19154
Comment 3 Tor Andersson 2011-02-02 20:32:14 UTC
*** Bug 691746 has been marked as a duplicate of this bug. ***
Comment 4 Benjamin Ullian 2011-02-07 05:22:53 UTC
ICU4C has an implementation of "inverse BiDi" visual-to-logical-reordering, available at 

http://icu-project.org/apiref/icu4c/ubidi_8h.html

with UBiDiReorderingMode from {UBIDI_REORDER_INVERSE_NUMBERS_AS_L, UBIDI_REORDER_INVERSE_LIKE_DIRECT, UBIDI_REORDER_INVERSE_FOR_NUMBERS_SPECIAL }
Comment 5 Tor Andersson 2014-05-15 08:13:21 UTC
Created attachment 10913 [details]
example with combining characters ('å' as U+02DA 'a')
Comment 6 Tor Andersson 2014-05-15 08:15:30 UTC
Fixed:

We expand the standard ligatures.
We expand ligatures using one-to-many ToUnicode CMap tables.
We run a (rudimentary) RTL visual-to-logic reordering pass.

Missing:

Normalizing text into NFC.