Bug 691056 - Normalize unicode text into NFC for better search.
Summary: Normalize unicode text into NFC for better search.
Status: CONFIRMED
Alias: None
Product: MuPDF
Classification: Unclassified
Component: mupdf (show other bugs)
Version: unspecified
Hardware: PC Windows XP
: P4 enhancement
Assignee: MuPDF bugs
URL:
Keywords:
: 691746 (view as bug list)
Depends on:
Blocks: 690681
  Show dependency tree
 
Reported: 2010-01-13 11:45 UTC by zeniko
Modified: 2018-08-28 06:53 UTC (History)
4 users (show)

See Also:
Customer:
Word Size: ---


Attachments
example with combining characters ('å' as U+02DA 'a') (33.45 KB, application/force-download)
2014-05-15 08:13 UTC, Tor Andersson
Details

Note You need to log in before you can comment on or make changes to this bug.
Description zeniko 2010-01-13 11:45:33 UTC
... else every consumer of pdf_loadtextfromtree will have to do it instead.

See http://code.google.com/p/sumatrapdf/source/detail?r=1693 for what we currently 
need to make documents searchable in SumatraPDF.
Comment 1 zeniko 2010-01-13 11:46:12 UTC
Ligature expansion has also been reported on its own as bug 690681.
Comment 2 Sebastian Rasmussen 2010-08-08 17:37:51 UTC
The poppler guys kindly link to another testfile:
https://bugs.freedesktop.org/show_bug.cgi?id=19154
Comment 3 Tor Andersson 2011-02-02 20:32:14 UTC
*** Bug 691746 has been marked as a duplicate of this bug. ***
Comment 4 Benjamin Ullian 2011-02-07 05:22:53 UTC
ICU4C has an implementation of "inverse BiDi" visual-to-logical-reordering, available at 

http://icu-project.org/apiref/icu4c/ubidi_8h.html

with UBiDiReorderingMode from {UBIDI_REORDER_INVERSE_NUMBERS_AS_L, UBIDI_REORDER_INVERSE_LIKE_DIRECT, UBIDI_REORDER_INVERSE_FOR_NUMBERS_SPECIAL }
Comment 5 Tor Andersson 2014-05-15 08:13:21 UTC
Created attachment 10913 [details]
example with combining characters ('å' as U+02DA 'a')
Comment 6 Tor Andersson 2014-05-15 08:15:30 UTC
Fixed:

We expand the standard ligatures.
We expand ligatures using one-to-many ToUnicode CMap tables.
We run a (rudimentary) RTL visual-to-logic reordering pass.

Missing:

Normalizing text into NFC.