691918 – Fixes for the Unicode Decoding resource(s)

Bug 691918 - Fixes for the Unicode Decoding resource(s)

Summary: Fixes for the Unicode Decoding resource(s)

Status:	RESOLVED FIXED

Alias:	None

Product:	Ghostscript
Classification:	Unclassified
Component:	Resource (show other bugs)
Version:	master
Hardware:	PC All

Importance:	P4 normal
Assignee:	Ken Sharp

URL:
Keywords:	bountiable

Depends on:
Blocks:

Reported:	2011-01-30 20:18 UTC by SaGS
Modified:	2011-02-10 10:34 UTC (History)
CC List:	2 users (show)

See Also:
Customer:
Word Size:	---

Attachments
Suggested new version of ‘Resource\Decoding\Unicode’ (138.96 KB, text/plain) 2011-01-30 20:21 UTC, SaGS	Details
Suggested new version of ‘Resource\Decoding\FCO_Unicode’ (70.80 KB, text/plain) 2011-01-30 20:25 UTC, SaGS	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description SaGS 2011-01-30 20:18:27 UTC

There are many differences in mappings found in Ghostscript’s 
‘Resource\Decoding\Unicode’ and many missing names, compared to the current version of the Adobe Glyph List. Also, names in the Zapf Dingbats font were removed in revision 4030 - unintentionally I think, because the log message only says ‘Fix : gs/Resource/Decoding/Unicode contained wrong codes for Cyrillic and Hebrew.’ I found such problems in the context of bug #691907 ‘PDFs with TrueType fonts from Windows PostScript files not searchable’, because in the absence of a GlyphNames2Unicode generated by the driver, this resource is used to construct the PDF ToUnicode CMap that is used for searching the PDF and copying text.

I’ll attach below (comment #1) an updated version of this file.

I attach also my best attempt at updating the ‘FCO_Unicode’ resource too, although this one is more problematic. Details in comment #2.

Comment 1 SaGS 2011-01-30 20:21:46 UTC

Created attachment 7168 [details]
Suggested new version of ‘Resource\Decoding\Unicode’

Bug #691918 : Update the mappings in the 'Unicode' Decoding resource, both fixing mappings for existing names and adding new names/ mappings. See comment in code for the various sources of glyph names.

Notes:
- The attachment is not a diff, but the full file. Because of formatting 
  changes (see comment in code), a diff would be unreadable.
- There are 112 changes to the mappings, and a lot of new names. No names 
  were deleted.

Comment 2 SaGS 2011-01-30 20:25:22 UTC

Created attachment 7169 [details]
Suggested new version of ‘Resource\Decoding\FCO_Unicode’

Bug #691918 : Update the mappings in the 'FCO_Unicode' Decoding resource, but without adding any new names. The updates are according to the current Adobe Glyph List v2.0. Includes changes that were made in the past to the ‘Unicode’ resource but not to ‘FCO_Unicod’ too.

DETAILS -------

When it was added to the repository, ‘FCO_Unicode’ was a slightly modified copy of ‘Unicode’. Since then, mappings in ‘Unicode’ were corrected a few times but not those in ‘FCO_Unicode’, so ‘FCO_Unicode’ suffered from bitrot.

To see what differences are really intended, I compared the original ‘FCO_Unicode’ with the ‘Unicode’ version at that time. The differences I found and what I did about them are listed below:

- The following 2 names had different mappings in ‘FCO_Unicode’ and
  ‘Unicode’ at the time that ‘FCO_Unicode’ was added, and there are 
  explicit comments these differences are desired; I preserved these 
  differences:

    fraction            mapped to 16#2215 instead of 16#2044 [kept]
    macron              mapped to 16#02C9 instead of 16#00AF [kept]

- The following were present in ‘FCO_Unicode’ but not in ‘Unicode’, 
  without any comment stating why; assuming these could come from 
  slightly wrong names in the FCO fonts, I preserved these differences:

    Delta               mapped to 16#0394 [AGLv2: Deltagreek]
    Omega               mapped to 16#03A9 [AGLv2: Omegagreek]
    periodcentered      mapped to 16#2022 [AGLv2: bullet]
    periodcentered.1    mapped to 16#00B7 [AGLv2: periodcentered]

- The following were also present in ‘FCO_Unicode’ but not in ‘Unicode’,
  without documenting the difference, but are now present in the new 
  ‘Unicode’ resource with identical mapping:

    hyphen              mapped to 16#002D [not a difference anymore]
    scedilla            mapped to 16#015F [not a difference anymore]
    Scedilla            mapped to 16#015E [not a difference anymore]
    tcommaaccent        mapped to 16#0163 [not a difference anymore]
    Tcommaaccent        mapped to 16#0162 [not a difference anymore]

- The following had different mapping in ‘FCO_Unicode’ and ‘Unicode’,
  but ‘Unicode’ was changed to the same mapping as in the old ‘FCO_Unicode’:

    mu                  mapped to 16#00B5, not 16#03BC
                                          [not a difference anymore]

- The following was present in both ‘FCO_Unicode’ and ‘Unicode’, but was
  removed from ‘Unicode’ since then; because originally it was not a 
  FCO-specific difference, I removed it also from the new ‘FCO_Unicode’:

    idot                originally mapped to 16#0069 [removed; AGLv2: i]

- The following was a duplicate mapping in ‘FCO_Unicode’, absent from
  ‘Unicode’. Since these resource types don’t allow duplicates (only
  one mapping is effective, the others are silently ignored), and because
  the ‘Unicode’ name assigned at the time for 16#021A looks incorrect 
  (upsilonlatin), I did not preserve this old mapping:

    Tcommaaccent        duplicate mapping to 16#021A [removed]

  (From another point of view, tcommaaccent -> 16#0163 and 
  Tcommaaccent -> 16#0162 are historical accidents. The 16#0163/ 16#0162
  have a cedilla, the t/ T with comma below are at 16#021B/ 16#021A.
  I kept the mappings in AGLv2, I did not fix them.)

Not having those ‘FCO’ fonts to check, and wanting to fix the wrong 
mappings but at the same time be as conservative as possible, the attached 
‘FCO_Unicode’ file:

- starts by being a copy of the new ‘Unicode’...
- ... but does not contain any NEW names compared to the existing
  ‘FCO-Unicode’ (new names would be useful only if the fonts use them) ...
- ... and preserves the 2+4 differences in the 1st 2 groups above.

If you consider a different approach (like adding all new AGL names too)
as more suitable, either drop me a note here or, at your choice, make 
any changes you consider necessary.

Comment 3 Henry Stiles 2011-02-03 18:01:20 UTC

Ken should be a bounty here probably.

Comment 4 Ken Sharp 2011-02-04 10:58:46 UTC

Adding bountiable keyword at Henry's suggestion (Henry is that all I need to do ?)

I've reviewed the changes and I'm happy with them, however we are currently in code freeze for the 9.01 release so I can't commit them yet. I'll leave the bug report open until the freeze is lifted and then commit and close.

Comment 5 Ken Sharp 2011-02-10 10:34:31 UTC

Changes adopted in revision 12139

http://ghostscript.com/pipermail/gs-cvs/2011-February/012297.html